Data Cleaning and Wrangling

Loading, cleaning, tidying and reshaping data with R.

Note: The tutorial series works alot with tidyr, dplyr, ggplot and similar packages. All of these libraries provide a concise and clear way to execute data reshaping and visualization in R beyond the vanilla approach.

If you need information about a function or package simply type ?function_name

tidyr: helps tidying data frames

dplyr: is a grammer for data manipulation

ggplot2: is a grammar for graphics to declaratively declare plots

In R are you declare the usage of a package via the library-function.



In [167]:

    
library(assertthat)
library(tidyr)
library(dplyr, warn.conflicts = F)
library(ggplot2)

Before using a package it is always usful to look at the vignette or the documentation. In general, if you want to know something about a function or package use a question mark.



In [168]:

    
?dplyr

Or double question mark to search for a specific term within the documentation system.



In [169]:

    
??dplyr

Import Data

R allows you to import data from csv, json but also from many other file formats like hdf5 or databases. Start with loading you data into a minimal set of logical related data frames and transform them into dplyr tables.



In [170]:

    
df <- read.csv('datasets/student-mat.csv', sep = ';') %>%
    tbl_df
df %>% head









    





school sex age address famsize Pstatus Medu Fedu Mjob Fjob ⋯ famrel freetime goout Dalc Walc health absences G1 G2 G3

	GP      F       18      U       GT3     A       4       4       at_home teacher ⋯       4       3       4       1       1       3        6       5       6       6      
	GP      F       17      U       GT3     T       1       1       at_home other   ⋯       5       3       3       1       1       3        4       5       5       6      
	GP      F       15      U       LE3     T       1       1       at_home other   ⋯       4       3       2       2       3       3       10       7       8      10      
	GP      F       15      U       GT3     T       4       2       health  services ⋯       3       2       2       1       1       5        2      15      14      15      
	GP      F       16      U       GT3     T       3       3       other   other   ⋯       4       3       2       1       2       5        4       6      10      10      
	GP      M       16      U       LE3     T       4       3       services other   ⋯       5       4       2       1       2       5       10      15      15      15

Conclusion on Importing

R can access various data sources - use the internet for specific data source adapters.
Load the data in a minimal set of data frames.
Sometimes R automatically converts strings to factors - be aware.
Encapsulate your loading logic in its own script to avoid change propagation.

Tidy Data

Tidy data describes a set of principles to organize data such that the follow up analysis is simplified. It is focused around data frames in which columns describe the variable and rows the observations. A variable (a column) could be the name of a person and every entry within the column describes a specific name e.g., John Doe.

This is similar to databases where tidy data would be a table in Codd's 3rd normal form but reframed in a statistical fashion:

Each variable forms a column,
each observation forms a row,
each type of observational unit forms a table.

The five most common problems with data and their respective data sets are:

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observation unit is stored in multiple tables.

Hadley provides an example for all these issues on the link above, so browse through.

Conventions

In addition to these common tidy principles it is highly recommended to use common software development best practices for naming variables but also for their respective values.

Data Cleaning

After loading it is adviseable to clean the data, i.e., define proper column names and recode the values into some sensible. Take for example the students data we loaded before:



In [171]:

    
df %>% head









    





school sex age address famsize Pstatus Medu Fedu Mjob Fjob ⋯ famrel freetime goout Dalc Walc health absences G1 G2 G3

	GP      F       18      U       GT3     A       4       4       at_home teacher ⋯       4       3       4       1       1       3        6       5       6       6      
	GP      F       17      U       GT3     T       1       1       at_home other   ⋯       5       3       3       1       1       3        4       5       5       6      
	GP      F       15      U       LE3     T       1       1       at_home other   ⋯       4       3       2       2       3       3       10       7       8      10      
	GP      F       15      U       GT3     T       4       2       health  services ⋯       3       2       2       1       1       5        2      15      14      15      
	GP      F       16      U       GT3     T       3       3       other   other   ⋯       4       3       2       1       2       5        4       6      10      10      
	GP      M       16      U       LE3     T       4       3       services other   ⋯       5       4       2       1       2       5       10      15      15      15

Immediate questions arise for example:

What does Medu and Fedu mean?
What are the G1-3 columns?
etc.

Rename Variable

So lets first give the columns long descriptive names, we can do this via the dplyr rename-function.



In [172]:

    
df <- df %>%
    rename(Sex = sex,
           Age = age,
           School = school,
           HomeArea = address,
           ParentStatus = Pstatus,
           EducationMother = Medu,
           JobMother = Mjob,
           EducationFather = Fedu,
           JobFather = Fjob,
           Guardian = guardian,
           FamilySize = famsize,
           FamilyRelationship = famrel,
           SchoolChoiceReason = reason,
           TravelTime = traveltime,
           StudyTime = studytime,
           ClassFailed = failures,
           EducationalSchoolSupport = schoolsup,
           EducationalFamilySupport = famsup,
           ExtraCurricularActivities = activities,
           ExtraPaidClass = paid,
           InternetAccess = internet,
           AttendedNurserySchool = nursery,
           TargetsHigherEducation = higher,
           RelationshipStatus = romantic,
           LeisureTime = freetime,
           SocialInteractionIntensity = goout,
           AlcoholConsumptionWeekend = Walc,
           AlcoholConsumptionWorkday = Dalc,
           HealthStatus = health,
           SchoolAbsences = absences,
           FirstPeriodGrade = G1,
           SecondPeriodGrade = G2,
           FinalGrade = G3) 

df %>% head









    





School Sex Age HomeArea FamilySize ParentStatus EducationMother EducationFather JobMother JobFather ⋯ FamilyRelationship LeisureTime SocialInteractionIntensity AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus SchoolAbsences FirstPeriodGrade SecondPeriodGrade FinalGrade

	GP      F       18      U       GT3     A       4       4       at_home teacher ⋯       4       3       4       1       1       3        6       5       6       6      
	GP      F       17      U       GT3     T       1       1       at_home other   ⋯       5       3       3       1       1       3        4       5       5       6      
	GP      F       15      U       LE3     T       1       1       at_home other   ⋯       4       3       2       2       3       3       10       7       8      10      
	GP      F       15      U       GT3     T       4       2       health  services ⋯       3       2       2       1       1       5        2      15      14      15      
	GP      F       16      U       GT3     T       3       3       other   other   ⋯       4       3       2       1       2       5        4       6      10      10      
	GP      M       16      U       LE3     T       4       3       services other   ⋯       5       4       2       1       2       5       10      15      15      15

Now that we actually know what the columns mean we can start to think about their values.

For instance

What does GP stand for? Is it a school type ?
What are the numbers in education? Years?
etc

Recode Values

So lets recode the actual levels of categorical data into some that makes sense.



In [173]:

    
RecodeEducation <- function(x) recode(x, `0` = 'None', `1` = 'Primary', `2` = 'PrimaryExtended', `3` = 'SecondaryExtended', `4` = 'Higher') 
RecodeJob <- function(x) recode(x, teacher = 'Education', services = 'Services', at_home = 'Home', other = 'Other', health = 'Health')
RecodeBinary <- function(x) recode(x, yes = 'Yes', no = 'No')
RecodeLikert <- function(x) recode(x, `1` = 'VeryLow', `2` = 'Low', `3` = 'Medium', `4` = 'High', `5` = 'VeryHigh')
    
df <- df %>%
    mutate(Sex = recode(Sex, F = 'Female', M = 'Male'),
           School = recode(School, GP = 'GabrielPereira', MS = 'MousinhoDaSilveira'),
           HomeArea = recode(HomeArea, U = 'Urban', R = 'Rural'),
           ParentStatus = recode(ParentStatus, T = 'Together', A = 'Apart'),
           EducationMother = RecodeEducation(EducationMother),
           JobMother = RecodeJob(JobMother),
           EducationFather = RecodeEducation(EducationFather),
           JobFather = RecodeJob(JobFather),
           Guardian = recode(Guardian, mother = 'Mother', father = 'Father', other = 'Other'),
           FamilySize = recode(FamilySize, GT3 = 'Large', LE3 = 'Small'),
           FamilyRelationship = recode(FamilyRelationship, `1` = 'VeryBad', `2` = 'Bad', `3` = 'Ok', `4` = 'Good', `5` = 'VeryGood'),
           SchoolChoiceReason = recode(SchoolChoiceReason, course = 'CoursePreference', other = 'Other', home = 'HomeProximity', reputation = 'Reputation'),
           TravelTime = recode(TravelTime, `1` = 'x < 15', `2` = '15 <= x < 30', `3` = '30 <= x < 60', `4` = 'x >= 60'),
           StudyTime = recode(StudyTime, `1` = 'x < 120', `2` = '120 <= x < 300', `3` = '300 <= x < 600', `4` = 'x >= 600')) %>%
    mutate_at(vars(EducationMother, EducationFather), .funs = RecodeEducation) %>%
    mutate_at(vars(JobMother, JobFather), .funs = RecodeJob) %>%
    mutate_at(vars(EducationalFamilySupport, 
                   EducationalSchoolSupport, 
                   ExtraCurricularActivities, 
                   ExtraPaidClass, 
                   InternetAccess, 
                   AttendedNurserySchool, 
                   TargetsHigherEducation, 
                   RelationshipStatus),
              .funs = RecodeBinary) %>%
    mutate_at(vars(LeisureTime,
                   SocialInteractionIntensity,
                   AlcoholConsumptionWeekend,
                   AlcoholConsumptionWorkday,
                   HealthStatus),
              .funs = RecodeLikert)



In [174]:

    
df %>% head









    





School Sex Age HomeArea FamilySize ParentStatus EducationMother EducationFather JobMother JobFather ⋯ FamilyRelationship LeisureTime SocialInteractionIntensity AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus SchoolAbsences FirstPeriodGrade SecondPeriodGrade FinalGrade

	GabrielPereira   Female           18               Urban            Large            Apart            Higher           Higher           Home             Education        ⋯                Good             Medium           High             VeryLow          VeryLow          Medium            6                5                6                6               
	GabrielPereira   Female           17               Urban            Large            Together         Primary          Primary          Home             Other            ⋯                VeryGood         Medium           Medium           VeryLow          VeryLow          Medium            4                5                5                6               
	GabrielPereira   Female           15               Urban            Small            Together         Primary          Primary          Home             Other            ⋯                Good             Medium           Low              Low              Medium           Medium           10                7                8               10               
	GabrielPereira   Female           15               Urban            Large            Together         Higher           PrimaryExtended  Health           Services         ⋯                Ok               Low              Low              VeryLow          VeryLow          VeryHigh          2               15               14               15               
	GabrielPereira   Female           16               Urban            Large            Together         SecondaryExtended SecondaryExtended Other            Other            ⋯                Good             Medium           Low              VeryLow          Low              VeryHigh          4                6               10               10               
	GabrielPereira   Male             16               Urban            Small            Together         Higher           SecondaryExtended Services         Other            ⋯                VeryGood         High             Low              VeryLow          Low              VeryHigh         10               15               15               15

This is much more readable and understandable and we can make quick sanity checks. For instance, now we actually know that education is categorical data but on a different scale as family relationship and so forth.

Inspect data

Next we want to check the basic characteristics the data frame in order:

Are the variables names correct?
Is the type correct?
Is the basic distribution of your categorical data as expected or did you omit, reverse, lost values?
Is the basic distribution of you interval data as expected or are the maxima, mean, quantiles off?
Are all variables of the correct type (factor or characters)?
Are all variables really variables or are they actual values?
Do you have a unique identifier for each observation?



In [175]:

    
summary(df)









    





                School        Sex           Age        HomeArea   FamilySize 
 GabrielPereira    :349   Female:208   Min.   :15.0   Rural: 88   Large:281  
 MousinhoDaSilveira: 46   Male  :187   1st Qu.:16.0   Urban:307   Small:114  
                                       Median :17.0                          
                                       Mean   :16.7                          
                                       3rd Qu.:18.0                          
                                       Max.   :22.0                          
   ParentStatus EducationMother    EducationFather        JobMother  
 Apart   : 41   Length:395         Length:395         Home     : 59  
 Together:354   Class :character   Class :character   Health   : 34  
                Mode  :character   Mode  :character   Other    :141  
                                                      Services :103  
                                                      Education: 58  
                                                                     
     JobFather          SchoolChoiceReason   Guardian    TravelTime       
 Home     : 20   CoursePreference:145      Father: 90   Length:395        
 Health   : 18   HomeProximity   :109      Mother:273   Class :character  
 Other    :217   Other           : 36      Other : 32   Mode  :character  
 Services :111   Reputation      :105                                     
 Education: 29                                                            
                                                                          
  StudyTime          ClassFailed     EducationalSchoolSupport
 Length:395         Min.   :0.0000   No :344                 
 Class :character   1st Qu.:0.0000   Yes: 51                 
 Mode  :character   Median :0.0000                           
                    Mean   :0.3342                           
                    3rd Qu.:0.0000                           
                    Max.   :3.0000                           
 EducationalFamilySupport ExtraPaidClass ExtraCurricularActivities
 No :153                  No :214        No :194                  
 Yes:242                  Yes:181        Yes:201                  
                                                                  
                                                                  
                                                                  
                                                                  
 AttendedNurserySchool TargetsHigherEducation InternetAccess RelationshipStatus
 No : 81               No : 20                No : 66        No :263           
 Yes:314               Yes:375                Yes:329        Yes:132           
                                                                               
                                                                               
                                                                               
                                                                               
 FamilyRelationship LeisureTime        SocialInteractionIntensity
 Length:395         Length:395         Length:395                
 Class :character   Class :character   Class :character          
 Mode  :character   Mode  :character   Mode  :character          
                                                                 
                                                                 
                                                                 
 AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus      
 Length:395                Length:395                Length:395        
 Class :character          Class :character          Class :character  
 Mode  :character          Mode  :character          Mode  :character  
                                                                       
                                                                       
                                                                       
 SchoolAbsences   FirstPeriodGrade SecondPeriodGrade   FinalGrade   
 Min.   : 0.000   Min.   : 3.00    Min.   : 0.00     Min.   : 0.00  
 1st Qu.: 0.000   1st Qu.: 8.00    1st Qu.: 9.00     1st Qu.: 8.00  
 Median : 4.000   Median :11.00    Median :11.00     Median :11.00  
 Mean   : 5.709   Mean   :10.91    Mean   :10.71     Mean   :10.42  
 3rd Qu.: 8.000   3rd Qu.:13.00    3rd Qu.:13.00     3rd Qu.:14.00  
 Max.   :75.000   Max.   :19.00    Max.   :19.00     Max.   :20.00

Conclusion on Cleaning & Tidying Data

Columns are variables and rows are observations
Each data frame captures one concept
Be consistent in naming and use style guides
Inspect you data frame via summary but also the actual cleaned values
Encapsulate your tidying logic in its own script to avoid change propagation

Working with Data Frames using Dplyr

Typical task are:

[Reshaping] How can i reshape the entire data frame?
[Windowing] How can we compute new columns?
[Summarise] How can we compute summary statistics?
[Selecting] How can we select specific columns?
[Filtering] How can we select specific rows?
[Grouping] How can we group parts of the data frame?
[Joining] How can we join tabes?
[Ordering] How can we order data frames?
[Distinct] How can we retrieve unique values?
[Checking] How can we check results?

Example - Adding Identifiers

One important thing are identifiers for observations. They help during joins and to keep track of the different observations after reshaping activities.

How can we add an identifier for each observation?



In [176]:

    
# [Windowing, Selecting]
df <- df %>%
    mutate(Id = row_number()) %>%
    select(Id, everything())
df %>% head









    





Id School Sex Age HomeArea FamilySize ParentStatus EducationMother EducationFather JobMother ⋯ FamilyRelationship LeisureTime SocialInteractionIntensity AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus SchoolAbsences FirstPeriodGrade SecondPeriodGrade FinalGrade

	1                GabrielPereira   Female           18               Urban            Large            Apart            Higher           Higher           Home             ⋯                Good             Medium           High             VeryLow          VeryLow          Medium            6                5                6                6               
	2                GabrielPereira   Female           17               Urban            Large            Together         Primary          Primary          Home             ⋯                VeryGood         Medium           Medium           VeryLow          VeryLow          Medium            4                5                5                6               
	3                GabrielPereira   Female           15               Urban            Small            Together         Primary          Primary          Home             ⋯                Good             Medium           Low              Low              Medium           Medium           10                7                8               10               
	4                GabrielPereira   Female           15               Urban            Large            Together         Higher           PrimaryExtended  Health           ⋯                Ok               Low              Low              VeryLow          VeryLow          VeryHigh          2               15               14               15               
	5                GabrielPereira   Female           16               Urban            Large            Together         SecondaryExtended SecondaryExtended Other            ⋯                Good             Medium           Low              VeryLow          Low              VeryHigh          4                6               10               10               
	6                GabrielPereira   Male             16               Urban            Small            Together         Higher           SecondaryExtended Services         ⋯                VeryGood         High             Low              VeryLow          Low              VeryHigh         10               15               15               15

Example - Reshape Data Frame

After cleaning the data, we noticed that G1-G3 are just different types of grade, thus values that should be contained within a column. We now might want to consolidate the three columns into on categorical column describing the type of grade (Frist, Second, Final) and one column that actually contains the mark itself.



In [177]:

    
# [Reshaping]
# use tidyr to collect multiple columns into two columns
df <- df %>% 
    gather(key = GradeName, 
           value = Grade, 
           FirstPeriodGrade, SecondPeriodGrade, FinalGrade)
df %>% head









    





Id School Sex Age HomeArea FamilySize ParentStatus EducationMother EducationFather JobMother ⋯ RelationshipStatus FamilyRelationship LeisureTime SocialInteractionIntensity AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus SchoolAbsences GradeName Grade

	1                GabrielPereira   Female           18               Urban            Large            Apart            Higher           Higher           Home             ⋯                No               Good             Medium           High             VeryLow          VeryLow          Medium            6               FirstPeriodGrade  5               
	2                GabrielPereira   Female           17               Urban            Large            Together         Primary          Primary          Home             ⋯                No               VeryGood         Medium           Medium           VeryLow          VeryLow          Medium            4               FirstPeriodGrade  5               
	3                GabrielPereira   Female           15               Urban            Small            Together         Primary          Primary          Home             ⋯                No               Good             Medium           Low              Low              Medium           Medium           10               FirstPeriodGrade  7               
	4                GabrielPereira   Female           15               Urban            Large            Together         Higher           PrimaryExtended  Health           ⋯                Yes              Ok               Low              Low              VeryLow          VeryLow          VeryHigh          2               FirstPeriodGrade 15               
	5                GabrielPereira   Female           16               Urban            Large            Together         SecondaryExtended SecondaryExtended Other            ⋯                No               Good             Medium           Low              VeryLow          Low              VeryHigh          4               FirstPeriodGrade  6               
	6                GabrielPereira   Male             16               Urban            Small            Together         Higher           SecondaryExtended Services         ⋯                No               VeryGood         High             Low              VeryLow          Low              VeryHigh         10               FirstPeriodGrade 15

Example - Convert Column

What are the grades in the Austrian mark system?

One way to handle this is to organize the facts into named vectors.



In [178]:

    
# portuguese marks
PORTUGUESE_MARKS <- c(worst = 0, 1:19, best = 20)
PORTUGUESE_MARKS



In [179]:

    
# austrian marks
AUSTRIAN_MARKS <- c(best = 1, 2:4, worst = 5)
AUSTRIAN_MARKS

To solve the problem we need to rescale and invert the portuguese grades such that they map between 1 to 5.

We can use feature scaling to map values from one scale to another scale given by

$$ FeatureScaling(mark) = oldMin + \dfrac{(mark - oldMin) \cdot (newMax - newMin)}{(oldMax - oldMin)}, $$

where oldX would describe the portuguese minimum and maximum value of the scale and newX would describe the austrian minimum and maximum.

The scale is then inverted by $$ InvertScale(mark) = newMax + 1 - mark. $$

The biggest advantage is that we can vectorize these computation on either the entire data frame or subsets of it.



In [180]:

    
FeatureScaling <- function(x, oldMax, oldMin, newMax, newMin){
  newMin + ((x - oldMin) * (newMax - newMin) / (oldMax - oldMin))  
} 

InvertScale <- function(x, max){
    max + 1 - x
}



In [181]:

    
# [Windowing]
gradeAustrian_df <- df %>%
    mutate(GradeAustrian = FeatureScaling(Grade, 
                                          oldMax = max(PORTUGUESE_MARKS), 
                                          oldMin = min(PORTUGUESE_MARKS),
                                          newMax = max(AUSTRIAN_MARKS), 
                                          newMin = min(AUSTRIAN_MARKS)),
           GradeAustrian = InvertScale(GradeAustrian, 
                                       max = max(AUSTRIAN_MARKS)))
gradeAustrian_df %>%
    head









    





Id School Sex Age HomeArea FamilySize ParentStatus EducationMother EducationFather JobMother ⋯ FamilyRelationship LeisureTime SocialInteractionIntensity AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus SchoolAbsences GradeName Grade GradeAustrian

	1                GabrielPereira   Female           18               Urban            Large            Apart            Higher           Higher           Home             ⋯                Good             Medium           High             VeryLow          VeryLow          Medium            6               FirstPeriodGrade  5               4.0              
	2                GabrielPereira   Female           17               Urban            Large            Together         Primary          Primary          Home             ⋯                VeryGood         Medium           Medium           VeryLow          VeryLow          Medium            4               FirstPeriodGrade  5               4.0              
	3                GabrielPereira   Female           15               Urban            Small            Together         Primary          Primary          Home             ⋯                Good             Medium           Low              Low              Medium           Medium           10               FirstPeriodGrade  7               3.6              
	4                GabrielPereira   Female           15               Urban            Large            Together         Higher           PrimaryExtended  Health           ⋯                Ok               Low              Low              VeryLow          VeryLow          VeryHigh          2               FirstPeriodGrade 15               2.0              
	5                GabrielPereira   Female           16               Urban            Large            Together         SecondaryExtended SecondaryExtended Other            ⋯                Good             Medium           Low              VeryLow          Low              VeryHigh          4               FirstPeriodGrade  6               3.8              
	6                GabrielPereira   Male             16               Urban            Small            Together         Higher           SecondaryExtended Services         ⋯                VeryGood         High             Low              VeryLow          Low              VeryHigh         10               FirstPeriodGrade 15               2.0

Example - Manual Checking

Next we manually check whether the computation was successful via two basic questions:

Are the boundaries correctly computed?
What are arbitrary max, min and midpoint values to check the conversion?



In [182]:

    
# [Selecting, Filtering, Distinct, Checking]
gradeAustrian_df %>%
    select(Id, Grade, GradeAustrian) %>%
    filter(Grade == 0 | Grade == 10 | Grade == 20) %>%
    distinct(Grade, .keep_all=TRUE)









    





Id Grade GradeAustrian

	 11 10 3  
	131  0 5  
	 48 20 1

One way to make automatic lightweight checks in your scripts is via assertions.



In [183]:

    
# [Checking]
assert_that(
    gradeAustrian_df %>%
        filter(GradeAustrian > 5 & GradeAustrian < 1) %>%
        nrow() 
    == 0
)









    




TRUE

Example - Summarise Data

What is average grade of a student?

The data frame contains now three rows per student since there are three different grade that we want to summarise. Nevertheless we want to apply the mean function only to the three rows associated with a specific student - time for grouping.



In [184]:

    
# [Grouping, Summarise]
gradeMean_df <- df %>% 
    group_by(Id) %>%
    summarise(GradeMean = mean(Grade))

gradeMean_df %>%
    head









    





Id GradeMean

	1         5.666667
	2         5.333333
	3         8.333333
	4        14.666667
	5         8.666667
	6        15.000000

Example - Joining

How can we add the mean grade to the existing data frame?



In [185]:

    
# [Joining]
# dplyr uses automatically matching columns to join on
# df %>%
#     inner_join(gradeMean_df)

# or if it is only one column simply defined the column
# df %>%
#     inner_join(gradeMean_df, by = 'Id')

# but best define the mapping to avoid mistakes
df %>%
    inner_join(gradeMean_df, by = c('Id' = 'Id')) %>%
    head









    





Id School Sex Age HomeArea FamilySize ParentStatus EducationMother EducationFather JobMother ⋯ FamilyRelationship LeisureTime SocialInteractionIntensity AlcoholConsumptionWorkday AlcoholConsumptionWeekend HealthStatus SchoolAbsences GradeName Grade GradeMean

	1                GabrielPereira   Female           18               Urban            Large            Apart            Higher           Higher           Home             ⋯                Good             Medium           High             VeryLow          VeryLow          Medium            6               FirstPeriodGrade  5                5.666667        
	2                GabrielPereira   Female           17               Urban            Large            Together         Primary          Primary          Home             ⋯                VeryGood         Medium           Medium           VeryLow          VeryLow          Medium            4               FirstPeriodGrade  5                5.333333        
	3                GabrielPereira   Female           15               Urban            Small            Together         Primary          Primary          Home             ⋯                Good             Medium           Low              Low              Medium           Medium           10               FirstPeriodGrade  7                8.333333        
	4                GabrielPereira   Female           15               Urban            Large            Together         Higher           PrimaryExtended  Health           ⋯                Ok               Low              Low              VeryLow          VeryLow          VeryHigh          2               FirstPeriodGrade 15               14.666667        
	5                GabrielPereira   Female           16               Urban            Large            Together         SecondaryExtended SecondaryExtended Other            ⋯                Good             Medium           Low              VeryLow          Low              VeryHigh          4               FirstPeriodGrade  6                8.666667        
	6                GabrielPereira   Male             16               Urban            Small            Together         Higher           SecondaryExtended Services         ⋯                VeryGood         High             Low              VeryLow          Low              VeryHigh         10               FirstPeriodGrade 15               15.000000

References

Data

Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance.

Data Wrangling

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	⋯	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
GP	F	18	U	GT3	A	4	4	at_home	teacher	⋯	4	3	4	1	1	3	6	5	6	6
GP	F	17	U	GT3	T	1	1	at_home	other	⋯	5	3	3	1	1	3	4	5	5	6
GP	F	15	U	LE3	T	1	1	at_home	other	⋯	4	3	2	2	3	3	10	7	8	10
GP	F	15	U	GT3	T	4	2	health	services	⋯	3	2	2	1	1	5	2	15	14	15
GP	F	16	U	GT3	T	3	3	other	other	⋯	4	3	2	1	2	5	4	6	10	10
GP	M	16	U	LE3	T	4	3	services	other	⋯	5	4	2	1	2	5	10	15	15	15

School	Sex	Age	HomeArea	FamilySize	ParentStatus	EducationMother	EducationFather	JobMother	JobFather	⋯	FamilyRelationship	LeisureTime	SocialInteractionIntensity	AlcoholConsumptionWorkday	AlcoholConsumptionWeekend	HealthStatus	SchoolAbsences	FirstPeriodGrade	SecondPeriodGrade	FinalGrade
GabrielPereira	Female	18	Urban	Large	Apart	Higher	Higher	Home	Education	⋯	Good	Medium	High	VeryLow	VeryLow	Medium	6	5	6	6
GabrielPereira	Female	17	Urban	Large	Together	Primary	Primary	Home	Other	⋯	VeryGood	Medium	Medium	VeryLow	VeryLow	Medium	4	5	5	6
GabrielPereira	Female	15	Urban	Small	Together	Primary	Primary	Home	Other	⋯	Good	Medium	Low	Low	Medium	Medium	10	7	8	10
GabrielPereira	Female	15	Urban	Large	Together	Higher	PrimaryExtended	Health	Services	⋯	Ok	Low	Low	VeryLow	VeryLow	VeryHigh	2	15	14	15
GabrielPereira	Female	16	Urban	Large	Together	SecondaryExtended	SecondaryExtended	Other	Other	⋯	Good	Medium	Low	VeryLow	Low	VeryHigh	4	6	10	10
GabrielPereira	Male	16	Urban	Small	Together	Higher	SecondaryExtended	Services	Other	⋯	VeryGood	High	Low	VeryLow	Low	VeryHigh	10	15	15	15