Once collected and saved into your local file system, the raw but regular data will need to be unified into an appropriate collection of numerical arrays for further analysis.

This is the second stage of the data analysis process. It involves operations on arrays such as merging, transforming, and aggregating the data, as well as the detection and treatment of abnormal or missing values that could cause errors or hinder algorithm performance.

Data cleaning

raw data gathered from multiple sources $\longrightarrow$ clean data tables ready for analysis

This involves:

  • removing bad rows (observations): duplicates, or too many missing or abnormal values
  • removing bad columns (variables): redundant variables, too many missing or abnormal values
  • massaging column values (variables): correct type, correct scale, correct values

The end goal is to prepare

  • the right set of data tables
  • with only clean and relevant variables
  • with only clean and relevant observations

ready for analysis.

Setting the right variable types

Consider the following data on the most popular movies of all times:


In [1]:
url = 'http://www.stat.berkeley.edu/classes/s133/data/movies.txt'

!curl $url 2>/dev/null | head -4


rank|name|box|date
1|Avatar|$759.563|December 18, 2009
2|Titanic|$600.788|December 19, 1997
3|The Dark Knight|$533.184|July 18, 2008

In [2]:
%load_ext rmagic

The data tabular with separator '|', we can directly load it into a data frame without any preprocessing:


In [3]:
%%R -i url

df = read.delim(url, sep='|', header=T)
head(df)


  rank                               name      box              date
1    1                             Avatar $759.563 December 18, 2009
2    2                            Titanic $600.788 December 19, 1997
3    3                    The Dark Knight $533.184     July 18, 2008
4    4 Star Wars: Episode IV - A New Hope $460.998      May 25, 1977
5    5                            Shrek 2 $437.212      May 19, 2004
6    6         E.T. the Extra-Terrestrial $434.975     June 11, 1982

Looking at the box and date variables, we see a potential mismatch in types:

  • the dollar sign in the box column seems to indicate that the box column is represented by a character vector instead of a numeric vector
  • the date column may also be represented by a character vector instead of a vector containing date objects

Let us check the variable types of this data frame.

A data frame is a class. Since classes in R are just enhanced lists (containing the vectors representing our variables, or columns), we can use the list apply function on our data frame df in the following way:

lapply(df, class)

which will return a list containing the classes of our data frame columns.

For a better ouput, we will futher construct a data frame out of the return value of lapply:

Categorical variables: The factor class


In [4]:
%%R

modes = data.frame(lapply(df, class))
modes


     rank   name    box   date
1 integer factor factor factor

We see here the class factor, which is used by R to store categorical variables.

Factors are constructed out of regular vectors using the class constructor:


In [5]:
%%R

sex = factor(c('M','F', 'F', 'F', 'M'))

A factor object stores

  • the category values as vector of integers
  • the category names as a character vector accessible though the function

    levels(x)


In [6]:
%%R

levels(sex)


[1] "F" "M"

The print function displays the category values as strings.


In [7]:
%%R

sex


[1] M F F F M
Levels: F M

The cat function displays the category values as integers.


In [8]:
%%R

cat(sex)


2 1 1 1 2

By default, the family of read functions interprets character columns as factors.

To prevent that, one needs to set the argument stringsAsFactors to FALSE:


In [9]:
%%R

df    = read.delim(url, sep='|', header=T, stringsAsFactors=F)
head(df)


  rank                               name      box              date
1    1                             Avatar $759.563 December 18, 2009
2    2                            Titanic $600.788 December 19, 1997
3    3                    The Dark Knight $533.184     July 18, 2008
4    4 Star Wars: Episode IV - A New Hope $460.998      May 25, 1977
5    5                            Shrek 2 $437.212      May 19, 2004
6    6         E.T. the Extra-Terrestrial $434.975     June 11, 1982

Now, character columns are interpreted as character vectors, but the types of the "box" column and the "date" column are still wrong:


In [10]:
%%R
modes = data.frame(lapply(df, class))

print(modes); cat('\n\n'); head(df)


     rank      name       box      date
1 integer character character character


  rank                               name      box              date
1    1                             Avatar $759.563 December 18, 2009
2    2                            Titanic $600.788 December 19, 1997
3    3                    The Dark Knight $533.184     July 18, 2008
4    4 Star Wars: Episode IV - A New Hope $460.998      May 25, 1977
5    5                            Shrek 2 $437.212      May 19, 2004
6    6         E.T. the Extra-Terrestrial $434.975     June 11, 1982

Time variables: The date class

R has date class used to represent temporal data.

One can create a date out of a date string in using the function:

as.Date(date_string, pattern)

where pattern is a string indicating how the date in date_string is formatted using the date place holders:

%d (day number)
%m (month in decimal)
%B (month in letter)
%b (in abreviated)
%y (year: two digits)
%Y (year: four digits)

This function returns a Date object, on which we can perform numerical operations:


In [14]:
%%R

a = as.Date('December 18, 2009', '%B %d, %Y')
b = as.Date('January 29, 2013', '%B %d, %Y')

b - a


Time difference of 1138 days

We can now correct the type in our date column:


In [19]:
%%R

df$date = as.Date(df$date, '%B %d, %Y')

print(head(df)); cat('\n\n')

data.frame(lapply(df, class))


  rank                               name      box       date
1    1                             Avatar $759.563 2009-12-18
2    2                            Titanic $600.788 1997-12-19
3    3                    The Dark Knight $533.184 2008-07-18
4    4 Star Wars: Episode IV - A New Hope $460.998 1977-05-25
5    5                            Shrek 2 $437.212 2004-05-19
6    6         E.T. the Extra-Terrestrial $434.975 1982-06-11


     rank      name       box date
1 integer character character Date

Regular expressions

We still need to correct our "box" collum, since it contains character strings of the type:

$759.563

and we would like actual numerical values instead.

Unfortunately, we can not use the conversion function

as.numeric(x)

directly because of the presence of the dollar sign.

The return value would be in this case a vector of NA values (mising values).


In [20]:
%%R

head(as.numeric(df$box))


[1] NA NA NA NA NA NA

R provides a collection of function to find and replace regular expressions in character vectors:

grep(pattern, x)
sub(pattern, replacement, x)
gsub(pattern, replacement, x)

where

  • x is a character vector
  • pattern is regular expression to match and replace
  • replacement is the replacement regular expression

grep returns the indices of elements matching pattern in the character vector x, while the two other functions, sub and gsub replace the matched pattern by replacement (the difference being that sub replaces only the first occurence of the matched pattern.

R uses the set of extended regular expressions.

Groups:

 ( ) delimits a group of characters
 |   means "either the group or character on the left or on the rigth of |"

Ranges:

   .    means "any character"
 [...]  means "any character enclosed between the brackets"
 [^...] means "any character not enclosed between the brackets"


Modifiers:

 *      means "the previous character or group occurs zero or many times"
 +      means "the previous character or group occurs one or many times"
 ?      means "the previous character or group occurs zero or one time"
 {n,m}  means "the previous character or group occurs between n or m times"
 {n}    means "the previous character or group occurs exactly n times"
 \      escape special characters

Positions:

 ^ means "at the beginning of the line"
 $ means "at the end of the line"


Ressources to train your regurlar expression muscles.

In our example, we need to remove the dollar sign from the box column.

Since the dollar sign has a special meaning as a regular expression, we will have to escape it:


In [21]:
%%R
pattern     = '\\$'

replacement = ''

box_values = sub(pattern, replacement, df$box)

tail(box_values)


[1] " 57.114" " 57.059" " 57.042" " 52.823" " 56.615" " 52.581"

Now that the values has been stripped from the dollar sign, we can convert them into numbers, and replace the "box" column in our data frame:


In [23]:
%%R

df$box = as.numeric(box_values)


print(tail(df))

data.frame(lapply(df, class))


     rank                     name    box       date
995   995                Beethoven 57.114 1992-04-03
996   996                    Annie 57.059 1982-05-21
997   997                  Beaches 57.042 1988-12-21
998   998      Message in a Bottle 52.823 1999-02-12
999   999 Resident Evil: Afterlife 56.615 2010-09-10
1000 1000    Kicking and Screaming 52.581 2005-05-13
     rank      name     box date
1 integer character numeric Date

Now we can save our data into a cleaned csv file for further use.


In [24]:
%%R

write.table(df, file='movies_cleaned.table', sep='|', row.names=F)

In [ ]:
%%R

movies = read.delim('movies_cleaned.table', sep='|', header=T, stringsAsFactors=F)

movies$date = as.Date(movies$date)

head(movies)

Detecting abnormal values

One way to check if the values of a categorical variable are okay is to compute the statistic summary and look for aberrant means, medians, etc.

Let's do that with the "box" variable:


In [27]:
%%R

summary(df$box)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  52.58   70.28   93.60  117.50  134.60  759.60 

This seems okay, but still, we'd like to see how many movies are close to the max and min values.

Some erroneous outlier values may have crept in, and we may see that by plotting

  • a boxplot
  • a histogram

of the variable values to spot outliers visualy.


In [29]:
%%R -r 86 -w 400 -h 400

boxplot(df$box, main="Movie box office summary")


It seems that there are quite a bunch of outliers. We may try to check the values by

  • retrieving the movie names
  • comparing their high success with our expectations

In [31]:
%%R

limit = 400

outliers = df[df$box > limit,]

print(dim(outliers))
print(outliers)


[1] 11  4
   rank                                       name     box       date
1     1                                     Avatar 759.563 2009-12-18
2     2                                    Titanic 600.788 1997-12-19
3     3                            The Dark Knight 533.184 2008-07-18
4     4         Star Wars: Episode IV - A New Hope 460.998 1977-05-25
5     5                                    Shrek 2 437.212 2004-05-19
6     6                 E.T. the Extra-Terrestrial 434.975 1982-06-11
7     7  Star Wars: Episode I - The Phantom Menace 431.088 1999-05-19
8     8 Pirates of the Caribbean: Dead Man's Chest 423.416 2006-07-07
9     9                                Toy Story 3 414.638 2010-06-18
10   10                                 Spider-Man 407.681 2002-05-03
11   11        Transformers: Revenge of the Fallen 402.077 2009-06-24

We can also plot an histogram of the variable:


In [33]:
%%R -r 86 -w 300 -h 300

hist(df$box, main='Box office histogram')


The situation seems to correspond to what we expect: A lot of movies in the same range, and a few with enormous box office.

Adding variables

Since we have a temporal information, we maybe interested in patterns in the time variable.

We may want to add a variable for our analysis, for instance the day of the week that a movie was released. The function

weekdays(date)
months(date)

will return the corresponding day of the week for a data.

Let's use is to create another variable and add it as a factor:


In [35]:
%%R

days = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')

df$weekday = factor(weekdays(df$date), levels=days)

tail(df)


     rank                     name    box       date   weekday
995   995                Beethoven 57.114 1992-04-03    Friday
996   996                    Annie 57.059 1982-05-21    Friday
997   997                  Beaches 57.042 1988-12-21 Wednesday
998   998      Message in a Bottle 52.823 1999-02-12    Friday
999   999 Resident Evil: Afterlife 56.615 2010-09-10    Friday
1000 1000    Kicking and Screaming 52.581 2005-05-13    Friday

We can now compute a frequency table for this new categorical variable, and diplay it as a barplot:


In [36]:
%%R

release_days = table(df$weekday)
release_days


   Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
       10        22       161        39       750         7        11 

There seems to be a pattern emerging. Let see it with a plot:


In [37]:
%%R -r 86 -w 800 -h 300

barplot(release_days)



In [ ]: