Data Cleaning and Wrangling

Loading, cleaning, tidying and reshaping data with R.

Note: The tutorial series works alot with tidyr, dplyr, ggplot and similar packages. All of these libraries provide a concise and clear way to execute data reshaping and visualization in R beyond the vanilla approach.

If you need information about a function or package simply type ?function_name

tidyr: helps tidying data frames

dplyr: is a grammer for data manipulation

ggplot2: is a grammar for graphics to declaratively declare plots

In R are you declare the usage of a package via the library-function.


In [167]:
library(assertthat)
library(tidyr)
library(dplyr, warn.conflicts = F)
library(ggplot2)

Before using a package it is always usful to look at the vignette or the documentation. In general, if you want to know something about a function or package use a question mark.


In [168]:
?dplyr

Or double question mark to search for a specific term within the documentation system.


In [169]:
??dplyr


Import Data

R allows you to import data from csv, json but also from many other file formats like hdf5 or databases. Start with loading you data into a minimal set of logical related data frames and transform them into dplyr tables.


In [170]:
df <- read.csv('datasets/student-mat.csv', sep = ';') %>%
    tbl_df
df %>% head


schoolsexageaddressfamsizePstatusMeduFeduMjobFjobfamrelfreetimegooutDalcWalchealthabsencesG1G2G3
GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 servicesother 5 4 2 1 2 5 10 15 15 15

Conclusion on Importing

  • R can access various data sources - use the internet for specific data source adapters.
  • Load the data in a minimal set of data frames.
  • Sometimes R automatically converts strings to factors - be aware.
  • Encapsulate your loading logic in its own script to avoid change propagation.

Tidy Data

Tidy data describes a set of principles to organize data such that the follow up analysis is simplified. It is focused around data frames in which columns describe the variable and rows the observations. A variable (a column) could be the name of a person and every entry within the column describes a specific name e.g., John Doe.

This is similar to databases where tidy data would be a table in Codd's 3rd normal form but reframed in a statistical fashion:

  1. Each variable forms a column,
  2. each observation forms a row,
  3. each type of observational unit forms a table.

The five most common problems with data and their respective data sets are:

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observation unit is stored in multiple tables.

Hadley provides an example for all these issues on the link above, so browse through.

Conventions

In addition to these common tidy principles it is highly recommended to use common software development best practices for naming variables but also for their respective values.

Data Cleaning

After loading it is adviseable to clean the data, i.e., define proper column names and recode the values into some sensible. Take for example the students data we loaded before:


In [171]:
df %>% head


schoolsexageaddressfamsizePstatusMeduFeduMjobFjobfamrelfreetimegooutDalcWalchealthabsencesG1G2G3
GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 servicesother 5 4 2 1 2 5 10 15 15 15

Immediate questions arise for example:

  • What does Medu and Fedu mean?
  • What are the G1-3 columns?
  • etc.

Rename Variable

So lets first give the columns long descriptive names, we can do this via the dplyr rename-function.


In [172]:
df <- df %>%
    rename(Sex = sex,
           Age = age,
           School = school,
           HomeArea = address,
           ParentStatus = Pstatus,
           EducationMother = Medu,
           JobMother = Mjob,
           EducationFather = Fedu,
           JobFather = Fjob,
           Guardian = guardian,
           FamilySize = famsize,
           FamilyRelationship = famrel,
           SchoolChoiceReason = reason,
           TravelTime = traveltime,
           StudyTime = studytime,
           ClassFailed = failures,
           EducationalSchoolSupport = schoolsup,
           EducationalFamilySupport = famsup,
           ExtraCurricularActivities = activities,
           ExtraPaidClass = paid,
           InternetAccess = internet,
           AttendedNurserySchool = nursery,
           TargetsHigherEducation = higher,
           RelationshipStatus = romantic,
           LeisureTime = freetime,
           SocialInteractionIntensity = goout,
           AlcoholConsumptionWeekend = Walc,
           AlcoholConsumptionWorkday = Dalc,
           HealthStatus = health,
           SchoolAbsences = absences,
           FirstPeriodGrade = G1,
           SecondPeriodGrade = G2,
           FinalGrade = G3) 

df %>% head


SchoolSexAgeHomeAreaFamilySizeParentStatusEducationMotherEducationFatherJobMotherJobFatherFamilyRelationshipLeisureTimeSocialInteractionIntensityAlcoholConsumptionWorkdayAlcoholConsumptionWeekendHealthStatusSchoolAbsencesFirstPeriodGradeSecondPeriodGradeFinalGrade
GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 servicesother 5 4 2 1 2 5 10 15 15 15

Now that we actually know what the columns mean we can start to think about their values.

For instance

  • What does GP stand for? Is it a school type ?
  • What are the numbers in education? Years?
  • etc

Recode Values

So lets recode the actual levels of categorical data into some that makes sense.


In [173]:
RecodeEducation <- function(x) recode(x, `0` = 'None', `1` = 'Primary', `2` = 'PrimaryExtended', `3` = 'SecondaryExtended', `4` = 'Higher') 
RecodeJob <- function(x) recode(x, teacher = 'Education', services = 'Services', at_home = 'Home', other = 'Other', health = 'Health')
RecodeBinary <- function(x) recode(x, yes = 'Yes', no = 'No')
RecodeLikert <- function(x) recode(x, `1` = 'VeryLow', `2` = 'Low', `3` = 'Medium', `4` = 'High', `5` = 'VeryHigh')
    
df <- df %>%
    mutate(Sex = recode(Sex, F = 'Female', M = 'Male'),
           School = recode(School, GP = 'GabrielPereira', MS = 'MousinhoDaSilveira'),
           HomeArea = recode(HomeArea, U = 'Urban', R = 'Rural'),
           ParentStatus = recode(ParentStatus, T = 'Together', A = 'Apart'),
           EducationMother = RecodeEducation(EducationMother),
           JobMother = RecodeJob(JobMother),
           EducationFather = RecodeEducation(EducationFather),
           JobFather = RecodeJob(JobFather),
           Guardian = recode(Guardian, mother = 'Mother', father = 'Father', other = 'Other'),
           FamilySize = recode(FamilySize, GT3 = 'Large', LE3 = 'Small'),
           FamilyRelationship = recode(FamilyRelationship, `1` = 'VeryBad', `2` = 'Bad', `3` = 'Ok', `4` = 'Good', `5` = 'VeryGood'),
           SchoolChoiceReason = recode(SchoolChoiceReason, course = 'CoursePreference', other = 'Other', home = 'HomeProximity', reputation = 'Reputation'),
           TravelTime = recode(TravelTime, `1` = 'x < 15', `2` = '15 <= x < 30', `3` = '30 <= x < 60', `4` = 'x >= 60'),
           StudyTime = recode(StudyTime, `1` = 'x < 120', `2` = '120 <= x < 300', `3` = '300 <= x < 600', `4` = 'x >= 600')) %>%
    mutate_at(vars(EducationMother, EducationFather), .funs = RecodeEducation) %>%
    mutate_at(vars(JobMother, JobFather), .funs = RecodeJob) %>%
    mutate_at(vars(EducationalFamilySupport, 
                   EducationalSchoolSupport, 
                   ExtraCurricularActivities, 
                   ExtraPaidClass, 
                   InternetAccess, 
                   AttendedNurserySchool, 
                   TargetsHigherEducation, 
                   RelationshipStatus),
              .funs = RecodeBinary) %>%
    mutate_at(vars(LeisureTime,
                   SocialInteractionIntensity,
                   AlcoholConsumptionWeekend,
                   AlcoholConsumptionWorkday,
                   HealthStatus),
              .funs = RecodeLikert)

In [174]:
df %>% head


SchoolSexAgeHomeAreaFamilySizeParentStatusEducationMotherEducationFatherJobMotherJobFatherFamilyRelationshipLeisureTimeSocialInteractionIntensityAlcoholConsumptionWorkdayAlcoholConsumptionWeekendHealthStatusSchoolAbsencesFirstPeriodGradeSecondPeriodGradeFinalGrade
GabrielPereira Female 18 Urban Large Apart Higher Higher Home Education Good Medium High VeryLow VeryLow Medium 6 5 6 6
GabrielPereira Female 17 Urban Large Together Primary Primary Home Other VeryGood Medium Medium VeryLow VeryLow Medium 4 5 5 6
GabrielPereira Female 15 Urban Small Together Primary Primary Home Other Good Medium Low Low Medium Medium 10 7 8 10
GabrielPereira Female 15 Urban Large Together Higher PrimaryExtended Health Services Ok Low Low VeryLow VeryLow VeryHigh 2 15 14 15
GabrielPereira Female 16 Urban Large Together SecondaryExtendedSecondaryExtendedOther Other Good Medium Low VeryLow Low VeryHigh 4 6 10 10
GabrielPereira Male 16 Urban Small Together Higher SecondaryExtendedServices Other VeryGood High Low VeryLow Low VeryHigh 10 15 15 15

This is much more readable and understandable and we can make quick sanity checks. For instance, now we actually know that education is categorical data but on a different scale as family relationship and so forth.

Inspect data

Next we want to check the basic characteristics the data frame in order:

  • Are the variables names correct?
  • Is the type correct?
  • Is the basic distribution of your categorical data as expected or did you omit, reverse, lost values?
  • Is the basic distribution of you interval data as expected or are the maxima, mean, quantiles off?
  • Are all variables of the correct type (factor or characters)?
  • Are all variables really variables or are they actual values?
  • Do you have a unique identifier for each observation?

In [175]:
summary(df)


                School        Sex           Age        HomeArea   FamilySize 
 GabrielPereira    :349   Female:208   Min.   :15.0   Rural: 88   Large:281  
 MousinhoDaSilveira: 46   Male  :187   1st Qu.:16.0   Urban:307   Small:114  
                                       Median :17.0                          
                                       Mean   :16.7                          
                                       3rd Qu.:18.0                          
                                       Max.   :22.0                          
   ParentStatus EducationMother    EducationFather        JobMother  
 Apart   : 41   Length:395         Length:395         Home     : 59  
 Together:354   Class :character   Class :character   Health   : 34  
                Mode  :character   Mode  :character   Other    :141  
                                                      Services :103  
                                                      Education: 58  
                                                                     
     JobFather          SchoolChoiceReason   Guardian    TravelTime       
 Home     : 20   CoursePreference:145      Father: 90   Length:395        
 Health   : 18   HomeProximity   :109      Mother:273   Class :character  
 Other    :217   Other           : 36      Other : 32   Mode  :character  
 Services :111   Reputation      :105                                     
 Education: 29                                                            
                                                                          
  StudyTime          ClassFailed     EducationalSchoolSupport
 Length:395         Min.   :0.0000   No :344                 
 Class :character   1st Qu.:0.0000   Yes: 51                 
 Mode  :character   Median :0.0000                           
                    Mean   :0.3342                           
                    3rd Qu.:0.0000                           
                    Max.   :3.0000                           
 EducationalFamilySupport ExtraPaidClass ExtraCurricularActivities
 No :153                  No :214        No :194                  
 Yes:242                  Yes:181        Yes:201                  
                                                                  
                                                                  
                                                                  
                                                                  
 AttendedNurserySchool TargetsHigherEducation InternetAccess RelationshipStatus
 No : 81               No : 20                No : 66        No :263           
 Yes:314               Yes:375                Yes:329        Yes:132           
                                                                               
                                                                               
                                                                               
                                                                               
 FamilyRelationship LeisureTime        SocialInteractionIntensity
 Length:395         Length:395         Length:395                
 Class :character   Class :character   Class :character          
 Mode  :character   Mode  :character   Mode  :character          
                                                                 
                                                                 
                                                                 
 AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus      
 Length:395                Length:395                Length:395        
 Class :character          Class :character          Class :character  
 Mode  :character          Mode  :character          Mode  :character  
                                                                       
                                                                       
                                                                       
 SchoolAbsences   FirstPeriodGrade SecondPeriodGrade   FinalGrade   
 Min.   : 0.000   Min.   : 3.00    Min.   : 0.00     Min.   : 0.00  
 1st Qu.: 0.000   1st Qu.: 8.00    1st Qu.: 9.00     1st Qu.: 8.00  
 Median : 4.000   Median :11.00    Median :11.00     Median :11.00  
 Mean   : 5.709   Mean   :10.91    Mean   :10.71     Mean   :10.42  
 3rd Qu.: 8.000   3rd Qu.:13.00    3rd Qu.:13.00     3rd Qu.:14.00  
 Max.   :75.000   Max.   :19.00    Max.   :19.00     Max.   :20.00  

Conclusion on Cleaning & Tidying Data

  • Columns are variables and rows are observations
  • Each data frame captures one concept
  • Be consistent in naming and use style guides
  • Inspect you data frame via summary but also the actual cleaned values
  • Encapsulate your tidying logic in its own script to avoid change propagation

Working with Data Frames using Dplyr

Typical task are:

  • [Reshaping] How can i reshape the entire data frame?
  • [Windowing] How can we compute new columns?
  • [Summarise] How can we compute summary statistics?
  • [Selecting] How can we select specific columns?
  • [Filtering] How can we select specific rows?
  • [Grouping] How can we group parts of the data frame?
  • [Joining] How can we join tabes?
  • [Ordering] How can we order data frames?
  • [Distinct] How can we retrieve unique values?
  • [Checking] How can we check results?

Example - Adding Identifiers

One important thing are identifiers for observations. They help during joins and to keep track of the different observations after reshaping activities.

How can we add an identifier for each observation?


In [176]:
# [Windowing, Selecting]
df <- df %>%
    mutate(Id = row_number()) %>%
    select(Id, everything())
df %>% head


IdSchoolSexAgeHomeAreaFamilySizeParentStatusEducationMotherEducationFatherJobMotherFamilyRelationshipLeisureTimeSocialInteractionIntensityAlcoholConsumptionWorkdayAlcoholConsumptionWeekendHealthStatusSchoolAbsencesFirstPeriodGradeSecondPeriodGradeFinalGrade
1 GabrielPereira Female 18 Urban Large Apart Higher Higher Home Good Medium High VeryLow VeryLow Medium 6 5 6 6
2 GabrielPereira Female 17 Urban Large Together Primary Primary Home VeryGood Medium Medium VeryLow VeryLow Medium 4 5 5 6
3 GabrielPereira Female 15 Urban Small Together Primary Primary Home Good Medium Low Low Medium Medium 10 7 8 10
4 GabrielPereira Female 15 Urban Large Together Higher PrimaryExtended Health Ok Low Low VeryLow VeryLow VeryHigh 2 15 14 15
5 GabrielPereira Female 16 Urban Large Together SecondaryExtendedSecondaryExtendedOther Good Medium Low VeryLow Low VeryHigh 4 6 10 10
6 GabrielPereira Male 16 Urban Small Together Higher SecondaryExtendedServices VeryGood High Low VeryLow Low VeryHigh 10 15 15 15

Example - Reshape Data Frame

After cleaning the data, we noticed that G1-G3 are just different types of grade, thus values that should be contained within a column. We now might want to consolidate the three columns into on categorical column describing the type of grade (Frist, Second, Final) and one column that actually contains the mark itself.


In [177]:
# [Reshaping]
# use tidyr to collect multiple columns into two columns
df <- df %>% 
    gather(key = GradeName, 
           value = Grade, 
           FirstPeriodGrade, SecondPeriodGrade, FinalGrade)
df %>% head


IdSchoolSexAgeHomeAreaFamilySizeParentStatusEducationMotherEducationFatherJobMotherRelationshipStatusFamilyRelationshipLeisureTimeSocialInteractionIntensityAlcoholConsumptionWorkdayAlcoholConsumptionWeekendHealthStatusSchoolAbsencesGradeNameGrade
1 GabrielPereira Female 18 Urban Large Apart Higher Higher Home No Good Medium High VeryLow VeryLow Medium 6 FirstPeriodGrade 5
2 GabrielPereira Female 17 Urban Large Together Primary Primary Home No VeryGood Medium Medium VeryLow VeryLow Medium 4 FirstPeriodGrade 5
3 GabrielPereira Female 15 Urban Small Together Primary Primary Home No Good Medium Low Low Medium Medium 10 FirstPeriodGrade 7
4 GabrielPereira Female 15 Urban Large Together Higher PrimaryExtended Health Yes Ok Low Low VeryLow VeryLow VeryHigh 2 FirstPeriodGrade 15
5 GabrielPereira Female 16 Urban Large Together SecondaryExtendedSecondaryExtendedOther No Good Medium Low VeryLow Low VeryHigh 4 FirstPeriodGrade 6
6 GabrielPereira Male 16 Urban Small Together Higher SecondaryExtendedServices No VeryGood High Low VeryLow Low VeryHigh 10 FirstPeriodGrade 15

Example - Convert Column

What are the grades in the Austrian mark system?

One way to handle this is to organize the facts into named vectors.


In [178]:
# portuguese marks
PORTUGUESE_MARKS <- c(worst = 0, 1:19, best = 20)
PORTUGUESE_MARKS


worst
0
2
1
3
2
4
3
5
4
6
5
7
6
8
7
9
8
10
9
11
10
12
11
13
12
14
13
15
14
16
15
17
16
18
17
19
18
20
19
best
20

In [179]:
# austrian marks
AUSTRIAN_MARKS <- c(best = 1, 2:4, worst = 5)
AUSTRIAN_MARKS


best
1
2
2
3
3
4
4
worst
5

To solve the problem we need to rescale and invert the portuguese grades such that they map between 1 to 5.

We can use feature scaling to map values from one scale to another scale given by

$$ FeatureScaling(mark) = oldMin + \dfrac{(mark - oldMin) \cdot (newMax - newMin)}{(oldMax - oldMin)}, $$

where oldX would describe the portuguese minimum and maximum value of the scale and newX would describe the austrian minimum and maximum.

The scale is then inverted by $$ InvertScale(mark) = newMax + 1 - mark. $$

The biggest advantage is that we can vectorize these computation on either the entire data frame or subsets of it.


In [180]:
FeatureScaling <- function(x, oldMax, oldMin, newMax, newMin){
  newMin + ((x - oldMin) * (newMax - newMin) / (oldMax - oldMin))  
} 

InvertScale <- function(x, max){
    max + 1 - x
}

In [181]:
# [Windowing]
gradeAustrian_df <- df %>%
    mutate(GradeAustrian = FeatureScaling(Grade, 
                                          oldMax = max(PORTUGUESE_MARKS), 
                                          oldMin = min(PORTUGUESE_MARKS),
                                          newMax = max(AUSTRIAN_MARKS), 
                                          newMin = min(AUSTRIAN_MARKS)),
           GradeAustrian = InvertScale(GradeAustrian, 
                                       max = max(AUSTRIAN_MARKS)))
gradeAustrian_df %>%
    head


IdSchoolSexAgeHomeAreaFamilySizeParentStatusEducationMotherEducationFatherJobMotherFamilyRelationshipLeisureTimeSocialInteractionIntensityAlcoholConsumptionWorkdayAlcoholConsumptionWeekendHealthStatusSchoolAbsencesGradeNameGradeGradeAustrian
1 GabrielPereira Female 18 Urban Large Apart Higher Higher Home Good Medium High VeryLow VeryLow Medium 6 FirstPeriodGrade 5 4.0
2 GabrielPereira Female 17 Urban Large Together Primary Primary Home VeryGood Medium Medium VeryLow VeryLow Medium 4 FirstPeriodGrade 5 4.0
3 GabrielPereira Female 15 Urban Small Together Primary Primary Home Good Medium Low Low Medium Medium 10 FirstPeriodGrade 7 3.6
4 GabrielPereira Female 15 Urban Large Together Higher PrimaryExtended Health Ok Low Low VeryLow VeryLow VeryHigh 2 FirstPeriodGrade 15 2.0
5 GabrielPereira Female 16 Urban Large Together SecondaryExtendedSecondaryExtendedOther Good Medium Low VeryLow Low VeryHigh 4 FirstPeriodGrade 6 3.8
6 GabrielPereira Male 16 Urban Small Together Higher SecondaryExtendedServices VeryGood High Low VeryLow Low VeryHigh 10 FirstPeriodGrade 15 2.0

Example - Manual Checking

Next we manually check whether the computation was successful via two basic questions:

  • Are the boundaries correctly computed?
  • What are arbitrary max, min and midpoint values to check the conversion?

In [182]:
# [Selecting, Filtering, Distinct, Checking]
gradeAustrian_df %>%
    select(Id, Grade, GradeAustrian) %>%
    filter(Grade == 0 | Grade == 10 | Grade == 20) %>%
    distinct(Grade, .keep_all=TRUE)


IdGradeGradeAustrian
1110 3
131 0 5
4820 1

One way to make automatic lightweight checks in your scripts is via assertions.


In [183]:
# [Checking]
assert_that(
    gradeAustrian_df %>%
        filter(GradeAustrian > 5 & GradeAustrian < 1) %>%
        nrow() 
    == 0
)


TRUE

Example - Summarise Data

What is average grade of a student?

The data frame contains now three rows per student since there are three different grade that we want to summarise. Nevertheless we want to apply the mean function only to the three rows associated with a specific student - time for grouping.


In [184]:
# [Grouping, Summarise]
gradeMean_df <- df %>% 
    group_by(Id) %>%
    summarise(GradeMean = mean(Grade))

gradeMean_df %>%
    head


IdGradeMean
1 5.666667
2 5.333333
3 8.333333
4 14.666667
5 8.666667
6 15.000000

Example - Joining

How can we add the mean grade to the existing data frame?


In [185]:
# [Joining]
# dplyr uses automatically matching columns to join on
# df %>%
#     inner_join(gradeMean_df)

# or if it is only one column simply defined the column
# df %>%
#     inner_join(gradeMean_df, by = 'Id')

# but best define the mapping to avoid mistakes
df %>%
    inner_join(gradeMean_df, by = c('Id' = 'Id')) %>%
    head


IdSchoolSexAgeHomeAreaFamilySizeParentStatusEducationMotherEducationFatherJobMotherFamilyRelationshipLeisureTimeSocialInteractionIntensityAlcoholConsumptionWorkdayAlcoholConsumptionWeekendHealthStatusSchoolAbsencesGradeNameGradeGradeMean
1 GabrielPereira Female 18 Urban Large Apart Higher Higher Home Good Medium High VeryLow VeryLow Medium 6 FirstPeriodGrade 5 5.666667
2 GabrielPereira Female 17 Urban Large Together Primary Primary Home VeryGood Medium Medium VeryLow VeryLow Medium 4 FirstPeriodGrade 5 5.333333
3 GabrielPereira Female 15 Urban Small Together Primary Primary Home Good Medium Low Low Medium Medium 10 FirstPeriodGrade 7 8.333333
4 GabrielPereira Female 15 Urban Large Together Higher PrimaryExtended Health Ok Low Low VeryLow VeryLow VeryHigh 2 FirstPeriodGrade 15 14.666667
5 GabrielPereira Female 16 Urban Large Together SecondaryExtendedSecondaryExtendedOther Good Medium Low VeryLow Low VeryHigh 4 FirstPeriodGrade 6 8.666667
6 GabrielPereira Male 16 Urban Small Together Higher SecondaryExtendedServices VeryGood High Low VeryLow Low VeryHigh 10 FirstPeriodGrade 15 15.000000

References

Data

Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance.

Data Wrangling

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10