Data types in a frame

A categorical variable is related to what is sometimes called an "enum" in other languages. R calls these "factors" and represents them efficiently as integers. For code designed for categorical variables, we need to tell R that this column of a data frame should be treated as a "factor."

Here is a column of strings:


In [60]:
stringcolors = c("red","red","blue","green","red")

In [61]:
stringcolors


Out[61]:
  1. 'red'
  2. 'red'
  3. 'blue'
  4. 'green'
  5. 'red'

Let's convert it to factors.


In [62]:
colors = factor(stringcolors)

In [63]:
colors


Out[63]:
  1. red
  2. red
  3. blue
  4. green
  5. red

In [64]:
is.factor(colors)


Out[64]:
TRUE

In [65]:
sizes = factor(c("S","M","M","L","S"))

In [66]:
prices = factor(c(19.99, 12.99, 9.99,12.99,19.99))

In [67]:
prices


Out[67]:
  1. 19.99
  2. 12.99
  3. 9.99
  4. 12.99
  5. 19.99

In [68]:
clothes = data.frame(colors,sizes,prices)

In [69]:
clothes


Out[69]:
colorssizesprices
1redS19.99
2redM12.99
3blueM9.99
4greenL12.99
5redS19.99

In [70]:
summary(clothes)


Out[70]:
   colors  sizes   prices 
 blue :1   L:1   9.99 :1  
 green:1   M:2   12.99:2  
 red  :3   S:2   19.99:2  

In [71]:
xtabs(~colors+sizes, data=clothes)


Out[71]:
       sizes
colors  L M S
  blue  0 1 0
  green 1 0 0
  red   0 1 2

But the sizes are in the wrong order; R has no way to know what it is. To tell R to treat a variable as ordinal, we do this:


In [72]:
sizes = factor(sizes,levels=c("S","M","L"), ordered=TRUE)

In [74]:
clothes = data.frame(colors,sizes,prices)

In [75]:
xtabs(~colors+sizes, data=clothes)


Out[75]:
       sizes
colors  S M L
  blue  0 1 0
  green 0 0 1
  red   2 1 0

Now the sizes have been ordered S<M<L


In [ ]: