Once collected and saved into your local file system, the raw but regular data will need to be unified into an appropriate collection of numerical arrays for further analysis.
This is the second stage of the data analysis process. It involves operations on arrays such as merging, transforming, and aggregating the data, as well as the detection and treatment of abnormal or missing values that could cause errors or hinder algorithm performance.
raw data gathered from multiple sources $\longrightarrow$ clean data tables ready for analysis
This involves:
The end goal is to prepare
ready for analysis.
Consider the following data on the most popular movies of all times:
In [1]:
url = 'http://www.stat.berkeley.edu/classes/s133/data/movies.txt'
!curl $url 2>/dev/null | head -4
In [2]:
%load_ext rmagic
The data tabular with separator '|', we can directly load it into a data frame without any preprocessing:
In [3]:
%%R -i url
df = read.delim(url, sep='|', header=T)
head(df)
Looking at the box and date variables, we see a potential mismatch in types:
Let us check the variable types of this data frame.
A data frame is a class. Since classes in R are just enhanced lists (containing the vectors representing our variables, or columns), we can use the list apply function on our data frame df
in the following way:
lapply(df, class)
which will return a list containing the classes of our data frame columns.
For a better ouput, we will futher construct a data frame out of the return value of lapply
:
In [4]:
%%R
modes = data.frame(lapply(df, class))
modes
We see here the class factor, which is used by R to store categorical variables.
Factors are constructed out of regular vectors using the class constructor:
In [5]:
%%R
sex = factor(c('M','F', 'F', 'F', 'M'))
A factor
object stores
the category names as a character vector accessible though the function
levels(x)
In [6]:
%%R
levels(sex)
The print
function displays the category values as strings.
In [7]:
%%R
sex
The cat
function displays the category values as integers.
In [8]:
%%R
cat(sex)
By default, the family of read functions interprets character columns as factors.
To prevent that, one needs to set the argument stringsAsFactors
to FALSE
:
In [9]:
%%R
df = read.delim(url, sep='|', header=T, stringsAsFactors=F)
head(df)
Now, character columns are interpreted as character vectors, but the types of the "box" column and the "date" column are still wrong:
In [10]:
%%R
modes = data.frame(lapply(df, class))
print(modes); cat('\n\n'); head(df)
R has date class used to represent temporal data.
One can create a date out of a date string in using the function:
as.Date(date_string, pattern)
where pattern
is a string indicating how the date in date_string
is formatted using the date place holders:
%d (day number)
%m (month in decimal)
%B (month in letter)
%b (in abreviated)
%y (year: two digits)
%Y (year: four digits)
This function returns a Date
object, on which we can perform numerical operations:
In [14]:
%%R
a = as.Date('December 18, 2009', '%B %d, %Y')
b = as.Date('January 29, 2013', '%B %d, %Y')
b - a
We can now correct the type in our date column:
In [19]:
%%R
df$date = as.Date(df$date, '%B %d, %Y')
print(head(df)); cat('\n\n')
data.frame(lapply(df, class))
We still need to correct our "box" collum, since it contains character strings of the type:
$759.563
and we would like actual numerical values instead.
Unfortunately, we can not use the conversion function
as.numeric(x)
directly because of the presence of the dollar sign.
The return value would be in this case a vector of NA values (mising values).
In [20]:
%%R
head(as.numeric(df$box))
R provides a collection of function to find and replace regular expressions in character vectors:
grep(pattern, x)
sub(pattern, replacement, x)
gsub(pattern, replacement, x)
where
x
is a character vectorpattern
is regular expression to match and replacereplacement
is the replacement regular expressiongrep
returns the indices of elements matching pattern
in the character vector x
, while the two other functions, sub
and gsub
replace the matched pattern by replacement
(the difference being that sub
replaces only the first occurence of the matched pattern.
R uses the set of extended regular expressions.
Groups:
( ) delimits a group of characters
| means "either the group or character on the left or on the rigth of |"
Ranges:
. means "any character"
[...] means "any character enclosed between the brackets"
[^...] means "any character not enclosed between the brackets"
Modifiers:
* means "the previous character or group occurs zero or many times"
+ means "the previous character or group occurs one or many times"
? means "the previous character or group occurs zero or one time"
{n,m} means "the previous character or group occurs between n or m times"
{n} means "the previous character or group occurs exactly n times"
\ escape special characters
Positions:
^ means "at the beginning of the line"
$ means "at the end of the line"
Ressources to train your regurlar expression muscles.
In our example, we need to remove the dollar sign from the box column.
Since the dollar sign has a special meaning as a regular expression, we will have to escape it:
In [21]:
%%R
pattern = '\\$'
replacement = ''
box_values = sub(pattern, replacement, df$box)
tail(box_values)
Now that the values has been stripped from the dollar sign, we can convert them into numbers, and replace the "box" column in our data frame:
In [23]:
%%R
df$box = as.numeric(box_values)
print(tail(df))
data.frame(lapply(df, class))
Now we can save our data into a cleaned csv file for further use.
In [24]:
%%R
write.table(df, file='movies_cleaned.table', sep='|', row.names=F)
In [ ]:
%%R
movies = read.delim('movies_cleaned.table', sep='|', header=T, stringsAsFactors=F)
movies$date = as.Date(movies$date)
head(movies)
One way to check if the values of a categorical variable are okay is to compute the statistic summary and look for aberrant means, medians, etc.
Let's do that with the "box" variable:
In [27]:
%%R
summary(df$box)
This seems okay, but still, we'd like to see how many movies are close to the max and min values.
Some erroneous outlier values may have crept in, and we may see that by plotting
of the variable values to spot outliers visualy.
In [29]:
%%R -r 86 -w 400 -h 400
boxplot(df$box, main="Movie box office summary")
It seems that there are quite a bunch of outliers. We may try to check the values by
In [31]:
%%R
limit = 400
outliers = df[df$box > limit,]
print(dim(outliers))
print(outliers)
We can also plot an histogram of the variable:
In [33]:
%%R -r 86 -w 300 -h 300
hist(df$box, main='Box office histogram')
The situation seems to correspond to what we expect: A lot of movies in the same range, and a few with enormous box office.
Since we have a temporal information, we maybe interested in patterns in the time variable.
We may want to add a variable for our analysis, for instance the day of the week that a movie was released. The function
weekdays(date)
months(date)
will return the corresponding day of the week for a data.
Let's use is to create another variable and add it as a factor:
In [35]:
%%R
days = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
df$weekday = factor(weekdays(df$date), levels=days)
tail(df)
We can now compute a frequency table for this new categorical variable, and diplay it as a barplot:
In [36]:
%%R
release_days = table(df$weekday)
release_days
There seems to be a pattern emerging. Let see it with a plot:
In [37]:
%%R -r 86 -w 800 -h 300
barplot(release_days)
In [ ]: