2.2: Using R to Organize and Manipulate Data
- Page ID
- 218885
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The data in Table \(\PageIndex{1}\) should remind you of a data frame, a way of organizing data in R that we introduced in Chapter 1. Here we will learn how to create a data frame that holds the data in Table \(\PageIndex{1}\) and learn how we can make us of the data frame.
Creating a Data Frame
To create a data frame we begin by creating vectors for each of the variables. Note that letters
is a constant in R that contains the 26 lower case letters of the Roman alphabet: here we are using just the first six letters for the bag ids.
bag_id = letters[1:6]
year = c(2006, 2006, 2000, 2000, 1994, 1994)
weight = c(1.74, 1.74, 0.80, 0.80, 10.0, 10.0)
type = c("peanut", "peanut", "plain", "plain", "plain", "plain")
number_yellow = c(2, 3, 1, 5, 56, 63)
percent_red = c(27.8, 4.35, 22.7, 20.8, 23.0, 21.9)
total = c(18, 23, 22, 24, 331, 333)
rank = c("sixth", "fourth", "fifth", "third", "second", "first")
To create the data frame, we use R’sdata.frame()
function, passing to it the names of our vectors, each of which must be of the same length. There is an option within this function to treat variables whose values are character strings as factors—another name for a categorical variable—by using the argument stringsAsFactors
= TRUE
. As the default value for this argument depends on your version of R, it is useful to make your choice explicit by including it in your code, as we do here.
mm_data = data.frame(bag_id, year, weight, type, number_yellow, percent_red, total, rank, stringsAsFactors = TRUE)
mm_data
bag_id year weight type number_yellow percent_red total rank
1 a 2006 1.74 peanut 2 27.80 18 sixth
2 b 2006 1.74 peanut 3 4.35 23 fourth
3 c 2000 0.80 plain 1 22.70 22 fifth
4 d 2000 0.80 plain 5 20.80 24 third
5 e 1994 10.00 plain 56 23.00 331 second
6 f 1994 10.00 plain 63 21.90 333 first
If we examine the structure of this data set using R’sstr()
function, we see that bag_id, type, and rank are factors and year, weight, number_yellow, percent_red, and total arenumerical variables, assignments that are consistent with our earlier analysis of the data.
str(mm_data)
'data.frame': 6 obs. of 8 variables:
$ bag_id : Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6
$ year : num 2006 2006 2000 2000 1994 ...
$ weight : num 1.74 1.74 0.8 0.8 10 10
$ type : Factor w/ 2 levels "peanut","plain": 1 1 2 2 2 2
$ number_yellow: num 2 3 1 5 56 63
$ percent_red : num 27.8 4.35 22.7 20.8 23 21.9
$ total : num 18 23 22 24 331 333
$ rank : Factor w/ 6 levels "fifth","first",..: 5 3 1 6 4 2
Finally, we can use the functionas.factor()
to have R treat a numerical variable as a categorical variable, as we do here for year. Why we might wish to do this is a topic we will return to in later chapters.
mm_year_as_factor = data.frame(bag_id, as.factor(year), percent_red, total)
str(mm_year_as_factor)
'data.frame': 6 obs. of 4 variables:
$ bag_id : Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6
$ as.factor.year.: Factor w/ 3 levels "1994","2000",..: 3 3 2 2 1 1
$ percent_red : num 27.8 4.35 22.7 20.8 23 21.9
$ total : num 18 23 22 24 331 33
Creating a New Data Frame by Subsetting an Existing Data Frame
In Chapter 1.2 we learned how to retrieve individual rows or columns from a data frame and assign them to a new object. Here we learn how to use R’s more flexible subset()
function to accomplish the same thing. Here, for example, we retrieve only the data for plain M&Ms.
plain_mm = subset(mm_data, type == "plain")
plain_mm
bag_id year weight type number_yellow percent_red total rank
3 c 2000 0.8 plain 1 22.7 22 fifth
4 d 2000 0.8 plain 5 20.8 24 third
5 e 1994 10.0 plain 56 23.0 331 second
6 f 1994 10.0 plain 63 21.9 333 first
Note that type == "plain"
uses a relational operator to choose only those rows in which the variable type
has the value plain
. Here is a list of relational operators:
operator | usage | meaning |
---|---|---|
< | x < y | x is less than y |
> | x > y | x is greater than y |
<= | x <= y | x is less than or equal to y |
>= | x >= y | x is greater than or equal to y |
== | x == y | x is exactly equal to y |
!= | x != y | x is not equal to y |
We can string variables together using the logical & operator.
mm_plain10 = subset(mm_data, (weight == 10.0 & type == "plain"))
mm_plain10
bag_id year weight type number_yellow percent_red total rank
5 e 1994 10 plain 56 23.0 331 second
6 f 1994 10 plain 63 21.9 333 first
We also can narrow the number of variables returned using the subset()
function’s select
argument. In this example we exclude samples collected before the year 2000 and return only the year, the number of yellow M&Ms, and the percentage of red M&Ms.
mm_20xx = subset(mm_data, year >= 2000, select = c(year, number_yellow, percent_red))
mm_20xx
year number_yellow percent_red
1 2006 2 27.80
2 2006 3 4.35
3 2000 1 22.70
4 2000 5 20.80