# 4.8: Using Excel and R to Analyze Data

• • Contributed by David Harvey
• Professor (Chemistry and Biochemistry) at DePauw University

Although the calculations in this chapter are relatively straightforward, it can be tedious to work problems using nothing more than a calculator. Both Excel and R include functions for descriptive statistics, for finding probabilities for different distributions, and for carrying out significance tests. In addition, R provides useful functions for visualizing your data.

## 4.8.1 Excel

Excel provides two methods for working with data: built-in functions for individual statistical parameters and a package of data analysis tools in the Analysis ToolPak. The ToolPak is not a standard part of Excel’s instillation. To see if you have access to the Analysis ToolPak on your computer, select Tools from the menu bar and look for the Data Analysis... option. If you do not see Data Analysis..., select Add-ins... from the Tools menu. Check the box for the Analysis ToolPak and click on OK to install them. (Once you install the Analysis ToolPak, it will continue to load each time you launch Excel.)

### Descriptive Statistics

Let’s use Excel to provide a statistical summary of the data in Table 4.1. Enter the data into a spreadsheet, as shown in Figure 4.18. Select Data Analysis... from the Tools menu, which opens a window entitled “Data Analysis.” Scroll through the window, select Descriptive Statistics from the available options, and click OK. Place the cursor in the box for the “Input Range” and then click and drag over the cells B1:B8. Check the box for “Labels in the first row.” Select the radio button for “Output range,” place the cursor in the box and click on an empty cell; this is where Excel will place the results. Check the boxes for “Summary statistics” and for the “Confidence level for the mean.” Accept the default value of 95% for the confidence level. Clicking OK generates the information shown in Figure 4.19. Figure 4.18 Portion of a spreadsheet containing data from Table 4.1. Figure 4.19 Output from Excel’s Descriptive Statistics command in the Analysis TookPak. Note that Excel does not adjust for significant figures. The mode is the most common result, which is not relevant here. Kurtosis is a measure of the “peakedness” of the data’s distribution, and is zero for a normal distribution. Skewness is a measure of the symmetry of the data’s distribution and is zero for a normal distribution. For a small sample size—such as the seven samples in this data set—skewness and kurtosis are not particularly useful. You may consult the textbooks listed in the Additional Resources for more information about kurtosis and skewness.

The Descriptive Statistics command provides a table of values for a sample. If your interest is in a statistical summary for a population, or if you do not want a summary of all descriptive statistics, you will need to use Excel’s built-in functions (Table 4.19). To use a function, click on an empty cell, enter the formula, and press Return or Enter. To find, for example, the population variance for the data in Figure 4.18, place the cursor in an empty cell and enter the formula

=varp(b2:b8)

The contents of the cell are replaced with Excel’s exact calculation of the population’s variance (0.002 221 918).

Table 4.19 Excel Functions for Descriptive Statistics
Parameter Excel Function
mean =average(number1,number2,...)
median =median(number1,number2,...)
sample standard deviation =stdev(number1,number2,...)
population standard deviation =stdevp(number1,number2,...)
sample variance =var(number1,number2,...)
population variance =varp(number1,number2,...)
range =max((number1,number2,...) – min(number1,number2,...)

### Probability Distributions

In Example 4.11 we showed that 91.10% of a manufacturer’s analgesic tablets contain between 243 and 262 mg of aspirin. We obtained this result by calculating the deviation, z, of each limit from the population’s expected mean, μ, of 250 mg in terms of the population’s expected standard deviation, σ, of 5 mg. After calculating values for z, we used the table in Appendix 3 to find the area under the normal distribution curve between the two limits.

We can complete this calculation in Excel using the built-in normdist function. The function’s general format is

=normdist(x, μ, σ, TRUE)

where x is the limit of interest. The function returns the probability of obtaining a result of less than x from a normal distribution with a mean of μ and a standard deviation of σ (Figure 4.20). To solve Example 4.11 using Excel enter the following formulas into separate cells

=normdist(243, 250, 5, TRUE)

=normdist(262, 250, 5, TRUE)

obtaining results of 0.080 756 659 and 0.991 802 464. Subtract the smaller value from the larger value and adjust to the correct number of significant figures to arrive at a probability of 0.9910, or 99.10%. Figure 4.20 Shown in blue is the area returned by the function =normdist(x, μ, σ, TRUE)

Excel also includes a function for binomial distributions. The function’s format is

=binomdist(X, N, p, T/F)

where X is the number of times a particular outcome occurs in N trials, and p is the probability of X occurring in one trial. Setting the function’s last term to TRUE gives the total probability for any result up to X and setting it to FALSE to give the probability for X. Using Example 4.10 to test this function, we find that the probability of finding no atoms of 13C atoms in a molecule of cholesterol, C27H44O) using the formula

=binomdist(0, 27, 0.0111, FALSE)

which returns a value of 0.740 as an answer, after adjusting the significant figures. Using the formula

=binomdist(2, 27, 0.0111, TRUE)

we find that 99.7% of cholesterol molecules contain two or fewer atoms of 13C.

### Significance Tests

Excel’s Analysis ToolPak includes tools to help you complete the following significance tests covered in this chapter:

• an F-test of variances
• an unpaired t-test of sample means assuming equal variances
• an unpaired t-test of sample means assuming unequal variances
• a paired t-test for of sample means

Let's use the ToolPak to complete a t-test on the data in Table 4.11, which contains results for two experiments to determine the mass of a circulating U. S. penny. Enter the data from Table 4.11 into a spreadsheet as shown in Figure 4.21. Because the data in this case are unpaired, we will use Excel to complete an unpaired t-test. Before we can complete a t-test we must use an F-test to determine whether the variances for the two data sets are equal or unequal. Our null hypothesis is that the variances are equal, sSet12 = sSet22, and our alternative hypothesis is that the variances are not equal, sSet12sSet22. Figure 4.21 Portion of a spreadsheet containing the data in Table 4.11.

To complete the F-test select Data Analysis... from the Tools menu, which opens a window entitled “Data Analysis.” Scroll through the window, select F-Test Two Sample Variance from the available options, and click OK. Place the cursor in the box for the “Variable 1 range” and then click and drag over the cells B1:B8. Next, place the cursor in the box for “Variable 2 range” and then click and drag over the cells B1:B6. Check the box for “Labels.” In the box for “Alpha” enter your value for α. Excel’s F-test is a one-tailed test, so if you wish to complete a two-tailed test, as we do in this example, divide your α value by 2. We want our F-test to use an α of 0.05, so we enter 0.025 in the box for “Alpha.” Select the radio button for “Output range,” place the cursor in the box and click on an empty cell; this is where Excel will place the results. Clicking OK generates the information shown in Figure 4.22. Because Fexp, which is 1.87, is less than F(0.05, 6, 4), which is 9.20, we retain the null hypothesis and have no evidence of a difference between the variances at an α of 0.05. (See Example 4.18 for our earlier solution to this problem.) Figure 4.22 Output from Excel’s F-test in the Analysis TookPak. Note that results are for a one-tailed F-test. The value for “P(F<=f) one-tail” is the probability of incorrectly rejecting the null hypothesis of equal variances, and the value for “F Critical one-tail” is equivalent to a one-tailed F(0.025, 6, 4). For a two-tailed F-test, the probability of incorrectly rejecting the null hypothesis is 2 × P(F<=f), or 0.566, and F Critical is for F(0.05, 6, 4). Note: Excel assumes that the variance for variable 1 is greater than the variance for variable 2; thus, you must assign variable 1 to the data set with the larger of the two variances in order to correctly interpret the F-test.

Having found no evidence suggesting unequal variances, we now complete an unpaired t-test assuming equal variances. Our null hypothesis is that there is no difference between the means, XSet 1 = XSet 2, and our alternative hypothesis is that there is a difference between the means,XSet 1XSet 2. To complete the t-test select Data Analysis... from the Tools menu, which opens a window entitled "Data Analysis." Scroll through the window, select t-Test Two Sample Assuming Equal Variances from the available options, and click OK. Place the cursor in the box for the "Variable 1 range" and then click and drag over the cells B1:B8. Next, place the cursor in the box for "Variable 2 range" and then click and drag over the cells B1:B6. In the box for "Hypothesized mean difference," enter a value of 0. Check the box for "Labels." In the box for "Alpha" enter your value for α. Select the radio button for "Output range," place the cursor in the box and click on an empty cell; this is where Excel will place the results. Clicking OK generates the information shown in Figure 4.23. Because texp, which is 1.33, is less than t(0.05, 10), which is 2.23, we retain the null hypothesis and have no evidence of a difference between the means at an α of 0.05. (See Example 4.19 for our earlier solution
to this problem.) Figure 4.23 Output from Excel’s t-test assuming equal variances in the Analysis ToolPak. The absolute value of “t Stat” is equivalent to texp (note: because Excel subtracts the mean for variable 2 from the mean for variable 1, t Stat may be positive or negative). The values of “P(T<=t) one-tail” and of “P(T<=t) two-tail” are the probabilities of incorrectly rejecting the null hypothesis for a one-tailed or for a two-tailed t-test. The values for “t Critical one-tail” and for “t Critical two tail” are t(0.05, 10) for a one-tailed and two-tailed t-test, respectively.

The other significance tests in Excel work in the same format. The following practice exercise provides you with an opportunity to test yourself.

Practice Exercise 4.12

Rework Example 4.20 and Example 4.21 using Excel’s Analysis Tool-Pak.

## 4.8.2 R

R is a programming environment that provides powerful capabilities for analyzing data. There are many functions built into R’s standard installation and additional packages of functions are available from the R web site (www.r-project.org). Commands in R are not available from pull down menus. Instead, you interact with R by typing in commands.

### Descriptive Statistics

Let’s use R to provide a statistical summary of the data in Table 4.1. To do this we first need to create an object containing the data, which we do by typing in the following command.

> penny1=c(3.080, 3.094, 3.107, 3.056, 3.112, 3.174, 3.198)

Note

In R, the symbol ‘>’ is a prompt indicating that the program is waiting for you to enter a command. When you press ‘Return’ or ‘Enter,’ R executes the command.

Table 4.20 lists the commands in R for calculating basic descriptive statistics. As is true for Excel, R does not include stand alone commands for all statistics of interest, but we can calculate them using other commands. Using a command is easy—simply enter the appropriate code; for example, to find the sample’s variance enter the appropriate command at the prompt.

> var(penny1)
 0.002221918

Table 4.20 R Functions for Descriptive Statistics
Parameter R Function
mean mean(object)
median median(object)
sample standard deviation sd(object)
population standard deviation sd(object)*((length(object)-1)/length(object))^0.5
sample variance var(object)
population variance var(object)*((length(object)-1)/length(object))
range max(object)-min(object)

### Probability Distributions

In Example 4.11 we showed that 91.10% of a manufacturer’s analgesic tables contain between 243 and 262 mg of aspirin. We obtained this result by calculating the deviation, z, of each limit from the population’s expected mean, μ, of 250 mg in terms of the population’s expected standard deviation, σ, of 5 mg. After calculating values for z, we used the table in Appendix 3 to find the area under the normal distribution curve between the two limits.

We can complete this calculation in R using the function pnorm. The function’s general format is

pnorm(x, μ, σ)

where x is the limit of interest, μ is the distribution’s expected mean and σ is the distribution’s expected standard deviation. The function returns the probability of obtaining a result of less than x (Figure 4.24). Here is the output of an R session for solving Example 4.11.

> pnorm(243,250,5)
 0.08075666
> pnorm(262,250,5)
 0.9918025

Subtracting the smaller value from the larger value and adjusting to the correct number of significant figures gives the probability as 0.9910, or 99.10%. Figure 4.24 Shown in blue is the area returned by the function pnorm(x, μ, σ)

R also includes functions for binomial distributions. To find the probability of obtaining a particular outcome, X, in N trials we use the dbinom function.

dbinom(X, N, p)

where p is the probability of X occurring in one trial. Using Example 4.10 to test this function, we find that the probability of finding no atoms of 13C atoms in a molecule of cholesterol (C27H44O) is

> dbinom(0,27,0.0111)
 0.7397997

0.740 after adjusting the significant figures. To find the probability of obtaining any outcome up to a maximum value of X, we use the pbinom function.

pbinom(X, N, p)

To find the percentage of cholesterol molecules containing 0, 1, or 2 atoms of 13C, we enter

> pbinom(2,27,0.0111)
 0.9967226

and find that the answer is 99.7% of cholesterol molecules.

### Significance Tests

R includes commands to help you complete the following significance tests covered in this chapter:

• an F-test of variances
• an unpaired t-test of sample means assuming equal variances
• an unpaired t-test of sample means assuming unequal variances
• a paired t-test for of sample means
• Dixon’s Q-test for outliers
• Grubb’s test for outliers

Let’s use R to complete a t-test on the data in Table 4.11, which contains results for two experiments to determine the mass of a circulating U. S. penny. To do this, enter the data from Table 4.11 into two objects.

> penny1=c(3.080, 3.094, 3.107, 3.056, 3.112, 3.174, 3.198)
> penny2=c(3.052, 3.141, 3.083, 3.083, 3.048)

Because the data in this case are unpaired, we will use R to complete an unpaired t-test. Before we can complete a t-test we must use an F-test to determine whether the variances for the two data sets are equal or unequal. Our null hypothesis is that the variances are equal, sSet12 = sSet22, and our alternative hypothesis is that the variances are not equal, sSet12sSet22.

The command for a two-tailed F-test in R, which is our choice for this problem, is

var.test(X, Y)

where X and Y are the objects containing the data sets. Figure 4.25 shows the output from an R session to solve this problem.

Note

For a one-tailed F-test the command is one of the following

var.test(X, Y, alternative = “greater”)
var.test(X, Y, alternative = “less”)

where “greater” is used when the alternative hypothesis is sX2 > sY2, and “less” is used when the alternative hypothesis is sX2 < sY2. Figure 4.25 Output of an R session for an F-test of variances. The p-value of 0.5661 is the probability of incorrectly rejecting the null hypothesis that the variances are equal (note: R identifies the value α as a p-value). The 95% confidence interval is the range of values for Fexp that can be explained by random error. If this range includes the expected value for F, in this case 1.00, then there is insufficient evidence to reject the null hypothesis. Note that R does not adjust for significant figures.

(R calculates Fexp as (sX)2/(sY)2. If we use the command var.test(penny2, penny1) the output will give R as 0.534 and the 95% confidence interval as 0.0858 to 4.912. Because the expected value for Fexp of 1.00 falls within the confidence interval, we retain the null hypothesis of equal variances.)

R does not provide the critical value for F(0.05, 6, 4). Instead it reports the 95% confidence interval for Fexp. Because this confidence interval of 0.204 to 11.661 includes the expected value for F of 1.00, we retain the null hypothesis and have no evidence for a difference between the variances. R also provides the probability of incorrectly rejecting the null hypothesis, which in this case is 0.5561.

Having found no evidence suggesting unequal variances, we now complete an unpaired t-test assuming equal variances. Our null hypothesis is that there is no difference between the means, XSet1 = XSet2, and our alternative hypothesis is that there is a difference between the means, XSet1XSet2. In R there is a single command for all two-sample t-tests. The basic syntax for a two-tailed unpaired t-test with unequal variances is

t.test(X, Y, mu = 0, paired = FALSE, var.equal = FALSE)

where X and Y are the objects containing the data sets. You can change the underlined terms to alter the nature of the t-test. Replacing “var.equal = FALSE” to “var.equal = TRUE” makes this a two-tailed t-test with equal variances, and replacing “paired = FALSE” with “paired = TRUE” makes this a paired t-test. The term “mu = 0” is the expected difference between the means, which for a null hypothesis of XSet1 = XSet2 is 0. You can, of course, change this to suit your needs. The underlined terms are default values; if you omit them, then R assumes that you intend an unpaired two-tailed t-test of the null hypothesis that X = Y with unequal variances. Figure 4.26 shows the output of an R session for this problem. Figure 4.26 Output of an R session for an unpaired t-test with equal variances. The p-value of 0.2116 is the probability of incorrectly rejecting the null hypothesis that the means are equal (note: R identifies the value a as a p-value). The 95% confidence interval is the range of values for the difference between the means that can be explained by random error. If this range includes the expected value for the difference, in this case zero, then there is insufficient evidence to reject the null hypothesis. Note that R does not adjust for significant figures.

Note

To complete a one-sided t-test, include the command

alternative = “greater”

or

alternative = “less”

A one-sided paired t-test that the difference between two samples is greater than 0 becomes

t.test(X, Y, paired = TRUE, alternative = “greater”)

The p-value of 0.2116 means that there is a 21.16% probability of incorrectly rejecting the null hypothesis. The 95% confidence interval of -0.024 to 0.0958, which is for the difference between the sample means, includes the expected value of zero. Both ways of looking at the results of the t-test provide no evidence for rejecting the null hypothesis; thus, we retain the null hypothesis and find no evidence for a difference between the two samples.

Practice Exercise 4.13

Rework Example 4.20 and Example 4.21 using R.

Unlike Excel, R also includes functions for evaluating outliers. These functions are not part of R’s standard installation. To install them enter the following command within R (note: you will need an internet connection to download the package of functions).

> install.packages(“outliers”)

After installing the package, you will need to load the functions into R using the following command (note: you will need to do this step each time you begin a new R session as the package does not automatically load when you start R).

> library(“outliers”)

Note

You need to install a package once, but you need to load the package each time you plan to use it. There are ways to configure R so that it automatically loads certain packages; see An Introduction to R for more information (click here to view a PDF version of this document).

Let’s use this package to find the outlier in Table 4.16 using both Dixon’s Q-test and Grubb’s test. The commands for these tests are

dixon.test(X, type = 10, two.sided = TRUE)
grubbs.test(X, type = 10, two.sided = TRUE)

where X is the object containing the data, “type = 10” specifies that we are looking for one outlier, and “two.sided=TRUE” indicates that we are using the more conservative two-tailed test. Both tests have other variants that allow the testing of outliers on both ends of the data set (“type = 11”) or for more than one outlier (“type = 20”), but we will not consider these. Figure 4.27 shows the output of a session for this problem. For both tests the very small p-value indicates that we can treat as an outlier the penny with a mass of 2.514 g. Figure 4.27 Output of an R session for Dixon’s Q-test and Grubb’s test for outliers. The p-values for both tests show that we can treat as an outlier the penny with a mass of 2.514 g.

### Visualizing Data

One of the more useful features of R is the ability to visualize your data. Visualizing your data is important (a point we will return to in Chapter 5 when we consider the mathematical modeling of data) because it provides you with an intuitive feel for your data that can help you in applying and evaluating statistical tests. It is tempting to believe that a statistical analysis is foolproof, particularly if the probability for incorrectly rejecting the null hypothesis is small. Looking at a visual display of your data, however, can help you determine whether your data is normally distributed—a requirement for most of the significance tests in this chapter—and can help you identify potential outliers. There are many useful ways to look at your data, four of which we consider here. Figure 4.28 Four different ways to plot the data in Table 4.13: (a) histogram; (b) kernel density plot showing smoothed distribution and individual data points; (c) dot chart; and (d) box plot.

To plot data in R we will use the package “lattice,” which you will need to load using the following command.

> library(“lattice”)

To demonstrate the types of plots we can generate, we will use the object “penny,” which contains the masses of the 100 pennies in Table 4.13.

Note

You do not need to use the command install.package this time because lattice was automatically installed on your computer when you downloaded R.

Our first display is a histogram. To construct the histogram we use mass to divide the pennies into bins and plot the number of pennies or the percent of pennies in each bin on the y-axis as a function of mass on the x-axis. Figure 4.28a shows the result of entering

> histogram(penny, type = “percent”, xlab = “Mass (g)”, ylab = “Percent of Pennies”, main = “Histogram of Data in Table 4.13”)

Note

To create a histogram showing the number of pennies in each bin, change “percent” to “count.”

A histogram allows us to visualize the data’s distribution. In this example the data appear to follow a normal distribution, although the largest bin does not include the mean of 3.095 g and the distribution is not perfectly symmetric. One limitation of a histogram is that its appearance depends on how we choose to bin the data. Increasing the number of bins and centering the bins around the data’s mean gives a histogram that more closely approximates a normal distribution (Figure 4.10).

An alternative to the histogram is a kernel density plot, which is basically a smoothed histogram. In this plot each value in the data set is replaced with a normal distribution curve whose width is a function of the data set’s standard deviation and size. The resulting curve is a summation of the individual distributions. Figure 4.28b shows the result of entering the command

> densityplot(penny, xlab = “Mass of Pennies (g)”, main = “Kernel Density Plot of Data in Table 4.13”)

The circles at the bottom of the plot show the mass of each penny in the data set. This display provides a more convincing picture that the data in Table 4.13 are normally distributed, although we can see evidence of a small clustering of pennies with a mass of approximately 3.06 g.

We analyze samples to characterize the parent population. To reach a meaningful conclusion about a population, the samples must be representative of the population. One important requirement is that the samples must be random. A dot chart provides a simple visual display that allows us look for non-random trends. Figure 4.28c shows the result of entering

> dotchart(penny, xlab = “Mass of Pennies (g)”, ylab = “Penny Number”, main = “Dotchart of Data in Table 4.13”)

In this plot the masses of the 100 pennies are arranged along the y-axis in the order of sampling. If we see a pattern in the data along the y-axis, such as a trend toward smaller masses as we move from the first penny to the last penny, then we have clear evidence of non-random sampling. Because our data do not show a pattern, we have more confidence in the quality of our data.

Note

Note that the dispersion of points along the x-axis is not uniform, with more points occurring near the center of the x-axis than at either end. This pattern is as expected for a normal distribution.

The last plot we will consider is a box plot, which is a useful way to identify potential outliers without making any assumptions about the data’s distribution. A box plot contains four pieces of information about a data set: the median, the middle 50% of the data, the smallest value and the largest value within a set distance of the middle 50% of the data, and possible outliers. Figure 4.28d shows the result of entering

> bwplot(penny, xlab = “Mass of Pennies (g)”, main = “Boxplot of Data in Table 4.13)”

The black dot (•) is the data set’s median. The rectangular box shows the range of masses for the middle 50% of the pennies. This also is known as the interquartile range, or IQR. The dashed lines, which are called “whiskers,” extend to the smallest value and the largest value that are within ±1.5×IQR of the rectangular box. Potential outliers are shown as open circles (º). For normally distributed data the median will be near the center of the box and the whiskers will be equidistant from the box. As is often true in statistics, the converse is not true—finding that a boxplot is perfectly symmetric does not prove that the data are normally distributed.

Note

To find the interquartile range you first find the median, which divides the data in half. The median of each half provides the limits for the box. The IQR is the median of the upper half of the data minus the median for the lower half of the data. For the data in Table 4.13 the median is 3.098. The median for the lower half of the data is 3.068 and the median for the upper half of the data is 3.115. The IQR is 3.115 – 3.068 = 0.047. You can use the command “summary(penny)” in R to obtain these values.

The lower “whisker” extends to the first data point with a mass larger than

$\mathrm{3.068 - 1.5 × IQR = 3.068 - 1.5 × 0.047 = 2.9975}$

which for this data is 2.998 g. The upper “whisker” extends to the last data point with a mass smaller than

$\mathrm{3.115+1.5×IQR = 3.115 + 1.5×0.047 = 3.1855}$

which for this data is 3.181 g.

The box plot in Figure 4.28d is consistent with the histogram (Figure 4.28a) and the kernel density plot (Figure 4.28b). Together, the three plots provide evidence that the data in Table 4.13 are normally distributed. The potential outlier, with a mass of 3.198 g, is not sufficiently far away from the upper whisker to be of concern, particularly as the size of the data set (n = 100) is so large. A Grubb’s test on the potential outlier does not provide evidence for treating it as an outlier.

Practice Exercise 4.14

Use R to create a data set consisting of 100 values from a uniform distribution by entering the command

> data = runif(100, min = 0, max = 100)

A uniform distribution is one in which every value between the minimum and the maximum is equally probable. Examine the data set by creating a histogram, a kernel density plot, a dot chart, and a box plot. Briefly comment on what the plots tell you about the your sample and its parent population.