Note: The tutorial series works alot with tidyr, dplyr, ggplot and similar packages. All of these libraries provide a concise and clear way to execute data reshaping and visualization in R beyond the vanilla approach.
If you need information about a function or package simply type ?function_name
tidyr: helps tidying data frames
dplyr: is a grammer for data manipulation
ggplot2: is a grammar for graphics to declaratively declare plots
In R are you declare the usage of a package via the library-function.
In [167]:
library(assertthat)
library(tidyr)
library(dplyr, warn.conflicts = F)
library(ggplot2)
Before using a package it is always usful to look at the vignette or the documentation. In general, if you want to know something about a function or package use a question mark.
In [168]:
?dplyr
Or double question mark to search for a specific term within the documentation system.
In [169]:
??dplyr
In [170]:
df <- read.csv('datasets/student-mat.csv', sep = ';') %>%
tbl_df
df %>% head
Tidy data describes a set of principles to organize data such that the follow up analysis is simplified. It is focused around data frames in which columns describe the variable and rows the observations. A variable (a column) could be the name of a person and every entry within the column describes a specific name e.g., John Doe.
This is similar to databases where tidy data would be a table in Codd's 3rd normal form but reframed in a statistical fashion:
The five most common problems with data and their respective data sets are:
Hadley provides an example for all these issues on the link above, so browse through.
In [171]:
df %>% head
Immediate questions arise for example:
In [172]:
df <- df %>%
rename(Sex = sex,
Age = age,
School = school,
HomeArea = address,
ParentStatus = Pstatus,
EducationMother = Medu,
JobMother = Mjob,
EducationFather = Fedu,
JobFather = Fjob,
Guardian = guardian,
FamilySize = famsize,
FamilyRelationship = famrel,
SchoolChoiceReason = reason,
TravelTime = traveltime,
StudyTime = studytime,
ClassFailed = failures,
EducationalSchoolSupport = schoolsup,
EducationalFamilySupport = famsup,
ExtraCurricularActivities = activities,
ExtraPaidClass = paid,
InternetAccess = internet,
AttendedNurserySchool = nursery,
TargetsHigherEducation = higher,
RelationshipStatus = romantic,
LeisureTime = freetime,
SocialInteractionIntensity = goout,
AlcoholConsumptionWeekend = Walc,
AlcoholConsumptionWorkday = Dalc,
HealthStatus = health,
SchoolAbsences = absences,
FirstPeriodGrade = G1,
SecondPeriodGrade = G2,
FinalGrade = G3)
df %>% head
Now that we actually know what the columns mean we can start to think about their values.
For instance
In [173]:
RecodeEducation <- function(x) recode(x, `0` = 'None', `1` = 'Primary', `2` = 'PrimaryExtended', `3` = 'SecondaryExtended', `4` = 'Higher')
RecodeJob <- function(x) recode(x, teacher = 'Education', services = 'Services', at_home = 'Home', other = 'Other', health = 'Health')
RecodeBinary <- function(x) recode(x, yes = 'Yes', no = 'No')
RecodeLikert <- function(x) recode(x, `1` = 'VeryLow', `2` = 'Low', `3` = 'Medium', `4` = 'High', `5` = 'VeryHigh')
df <- df %>%
mutate(Sex = recode(Sex, F = 'Female', M = 'Male'),
School = recode(School, GP = 'GabrielPereira', MS = 'MousinhoDaSilveira'),
HomeArea = recode(HomeArea, U = 'Urban', R = 'Rural'),
ParentStatus = recode(ParentStatus, T = 'Together', A = 'Apart'),
EducationMother = RecodeEducation(EducationMother),
JobMother = RecodeJob(JobMother),
EducationFather = RecodeEducation(EducationFather),
JobFather = RecodeJob(JobFather),
Guardian = recode(Guardian, mother = 'Mother', father = 'Father', other = 'Other'),
FamilySize = recode(FamilySize, GT3 = 'Large', LE3 = 'Small'),
FamilyRelationship = recode(FamilyRelationship, `1` = 'VeryBad', `2` = 'Bad', `3` = 'Ok', `4` = 'Good', `5` = 'VeryGood'),
SchoolChoiceReason = recode(SchoolChoiceReason, course = 'CoursePreference', other = 'Other', home = 'HomeProximity', reputation = 'Reputation'),
TravelTime = recode(TravelTime, `1` = 'x < 15', `2` = '15 <= x < 30', `3` = '30 <= x < 60', `4` = 'x >= 60'),
StudyTime = recode(StudyTime, `1` = 'x < 120', `2` = '120 <= x < 300', `3` = '300 <= x < 600', `4` = 'x >= 600')) %>%
mutate_at(vars(EducationMother, EducationFather), .funs = RecodeEducation) %>%
mutate_at(vars(JobMother, JobFather), .funs = RecodeJob) %>%
mutate_at(vars(EducationalFamilySupport,
EducationalSchoolSupport,
ExtraCurricularActivities,
ExtraPaidClass,
InternetAccess,
AttendedNurserySchool,
TargetsHigherEducation,
RelationshipStatus),
.funs = RecodeBinary) %>%
mutate_at(vars(LeisureTime,
SocialInteractionIntensity,
AlcoholConsumptionWeekend,
AlcoholConsumptionWorkday,
HealthStatus),
.funs = RecodeLikert)
In [174]:
df %>% head
This is much more readable and understandable and we can make quick sanity checks. For instance, now we actually know that education is categorical data but on a different scale as family relationship and so forth.
Next we want to check the basic characteristics the data frame in order:
In [175]:
summary(df)
Typical task are:
How can we add an identifier for each observation?
In [176]:
# [Windowing, Selecting]
df <- df %>%
mutate(Id = row_number()) %>%
select(Id, everything())
df %>% head
After cleaning the data, we noticed that G1-G3 are just different types of grade, thus values that should be contained within a column. We now might want to consolidate the three columns into on categorical column describing the type of grade (Frist, Second, Final) and one column that actually contains the mark itself.
In [177]:
# [Reshaping]
# use tidyr to collect multiple columns into two columns
df <- df %>%
gather(key = GradeName,
value = Grade,
FirstPeriodGrade, SecondPeriodGrade, FinalGrade)
df %>% head
One way to handle this is to organize the facts into named vectors.
In [178]:
# portuguese marks
PORTUGUESE_MARKS <- c(worst = 0, 1:19, best = 20)
PORTUGUESE_MARKS
In [179]:
# austrian marks
AUSTRIAN_MARKS <- c(best = 1, 2:4, worst = 5)
AUSTRIAN_MARKS
To solve the problem we need to rescale and invert the portuguese grades such that they map between 1 to 5.
We can use feature scaling to map values from one scale to another scale given by
$$ FeatureScaling(mark) = oldMin + \dfrac{(mark - oldMin) \cdot (newMax - newMin)}{(oldMax - oldMin)}, $$where oldX would describe the portuguese minimum and maximum value of the scale and newX would describe the austrian minimum and maximum.
The scale is then inverted by $$ InvertScale(mark) = newMax + 1 - mark. $$
The biggest advantage is that we can vectorize these computation on either the entire data frame or subsets of it.
In [180]:
FeatureScaling <- function(x, oldMax, oldMin, newMax, newMin){
newMin + ((x - oldMin) * (newMax - newMin) / (oldMax - oldMin))
}
InvertScale <- function(x, max){
max + 1 - x
}
In [181]:
# [Windowing]
gradeAustrian_df <- df %>%
mutate(GradeAustrian = FeatureScaling(Grade,
oldMax = max(PORTUGUESE_MARKS),
oldMin = min(PORTUGUESE_MARKS),
newMax = max(AUSTRIAN_MARKS),
newMin = min(AUSTRIAN_MARKS)),
GradeAustrian = InvertScale(GradeAustrian,
max = max(AUSTRIAN_MARKS)))
gradeAustrian_df %>%
head
In [182]:
# [Selecting, Filtering, Distinct, Checking]
gradeAustrian_df %>%
select(Id, Grade, GradeAustrian) %>%
filter(Grade == 0 | Grade == 10 | Grade == 20) %>%
distinct(Grade, .keep_all=TRUE)
One way to make automatic lightweight checks in your scripts is via assertions.
In [183]:
# [Checking]
assert_that(
gradeAustrian_df %>%
filter(GradeAustrian > 5 & GradeAustrian < 1) %>%
nrow()
== 0
)
What is average grade of a student?
The data frame contains now three rows per student since there are three different grade that we want to summarise. Nevertheless we want to apply the mean function only to the three rows associated with a specific student - time for grouping.
In [184]:
# [Grouping, Summarise]
gradeMean_df <- df %>%
group_by(Id) %>%
summarise(GradeMean = mean(Grade))
gradeMean_df %>%
head
In [185]:
# [Joining]
# dplyr uses automatically matching columns to join on
# df %>%
# inner_join(gradeMean_df)
# or if it is only one column simply defined the column
# df %>%
# inner_join(gradeMean_df, by = 'Id')
# but best define the mapping to avoid mistakes
df %>%
inner_join(gradeMean_df, by = c('Id' = 'Id')) %>%
head
References
Data
Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance.
Data Wrangling
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10