Beggining with Data Cleaning with Pandas

Introduction

The aim of this notebook is to introduce you to basic data cleaning using Python and Pandas. Most of the contents follow the ideas presented the great report of Jonge van der Loo - [Introduction to data cleaning with R][1].

As explained in 1, most of Data Scientist's work is spent in cleaning preparing data before any statistical analysis or model application. It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. Wiley-IEEE.). Of course one can find data sources date are ready to go, but usually these sources are already explored data sets on practical examples. The reality however, is that data is full of errors, without format consistency and potentially incomplete. The Data Scientist mission is to convert these raw data sources into consistent data sets that can be used as input for further analysis.

Even with technically correct data sets, and after hard work on cleaning, checking, filling data sets can lack of a standard way to organize data values within a dataset. Hadley Wickham defined this standardization as Tidy Data.

Statistical Analysis in Five Steps

Van der Loo defines statistical data analysis in five steps:

  1. Raw data:
    | type checking, normalizing         
    v                                    
  2. Technically correct data
    | fix and impute                     
    v                                    
  3. Consistent data
    | estimate, analyze, derive, etc.
    v
  4. Statistical results
    | tabulate, plot
    v
  5. Formatted output

In the previous graph you find a numbered list with five items. Each item represent data in different states and arrows represent actions needed at each step to move to the next one. It has to be noted that when data is transformed from first to fifth state it gains value at each step in an incremental way.

At the first stage, we have data as is. Raw data is a rough diamond that is going to be cut and shined at each step. Among the errors we can find there are: wrong types, different variable encoding, data without labels, etc.

Technically correct data, is data that can be loaded into Pandas structures, let's say that it has the proper "shape" with correct names, types, labels and so on. However variables may be out of range or potentially inconsistent (relations between variables).

In consistent data stage, data is ready for statistical inference. For example the total amount of incomes in a year is the sum of all months incomes.

The later stages contain statistical results derived from the analysis that ultimately can be formatted to provide a synthetic layout.

Best Practice It is a good idea to keep the input of each step in a local file, and the methods applied ready to be reproduced at each stage (at least).

Why? We will see that at each stage, we can potentially loose or modify initial data. This loose or modification of data can influence final analysis. All operations performed over a data set should be reproducible.

Python offers a good interactive environment that facilitates the transformation and computation of datasets while generating a nice scripting framework to reproduce procedures.

Kind Reminder on (statistical) variables

Data cleaning can be seen as the first step on statistical analysis, and as programmers we tend to forget or mess the statistical terms. What an statistician says when it says variable? For a computer programmer, a variable is a memory space that can be filled with a know (or unknown) quantity of information (a.k.a. value). Moreover, this space has an associated notation alias that can be used in a program in running time to modify the value of the variable. Well, don't take this as an exact definition, but it helps to provide us a general refresh of what a variable is (for us the CS).

Well, statisticians have their own variables, lets have an (again) informal definition. In statistics, a variable is an attribute that describes a person, place, thing, or idea (often referred as feature).

As an example, we can take the list of physical characteristics of 10 persons. The objects of the matrix are the persons, the variables are the measured properties, such as the weight or the color of the eyes.


In [2]:
import pandas as pd
df = pd.read_csv('../data/people.csv',index_col=0)
df


Out[2]:
Age[years] Sex Weight[kg] Eye Color Body Temperature[C]
individuum 1 42 female 52.9 brown 36.9
individuum 2 37 male 87.0 green 36.3
individuum 3 29 male 82.1 blue 36.4
individuum 4 61 female 62.5 blue 36.7
individuum 5 77 female 55.5 gray 36.6
individuum 6 33 male 95.2 green 36.5
individuum 7 32 female 81.8 brown 37.0
individuum 8 45 male 78.9 brown 36.3
individuum 9 18 male 83.4 green 36.6
individuum 10 19 male 84.7 gray 36.1

Variables can be classified as qualitative (aka, categorical) or quantitative (aka, numeric).

  • Qualitative: Qualitative variables take on values that are names or labels. The eye color (e.g., brown, green, gray, etc.) or the sex of the person (e.g., female, male) would be examples of qualitative or categorical variables.

  • Quantitative: Quantitative variables are numeric. They represent a measurable quantity. For example, when we speak of the age of a person, we are talking about the time past from its birth - a measurable attribute of the persom. Therefore, age would be a qualitative variable.

Basically in Pandas there are two fundamental data structures Series and DataFrame. According to the previous definition of statistical variables, a Series would be a variable an a DataFrame would be a set of variables. Moreover, if we slice a DataFrame (set of variables) we get a Series (variable).


In [3]:
print(type(df["Sex"]))


<class 'pandas.core.series.Series'>

And what kind of variable is for example the weight of a person?


In [4]:
print(df["Age[years]"].dtype)


int64

And what about categorical variables? Well, that deserve more than a word, so commonly for us categorical variables will have object dtype.


In [5]:
print(df["Sex"].dtype)
print (df["Eye Color"].dtype)


object
object

Special Values

There are some values that are considered as special. This is the case when we have missing data (this is when no data value is stored for the variable in an observation). There are other cases, such as Inf and -Inf, etc.

Missing data

Let's suppose that we are getting individuum attributes, and we loose or the height of an individuum can't be taken. We always try to maximize all the data we have, so the best way to deal with this kind of situations is to mark these value as missing.


In [6]:
import pandas as pd
df = pd.read_csv('../data/people_with_nas.csv',index_col=0,na_values="NaN")
print(df)
print(df["Weight[kg]"])
pd.isnull(df["Weight[kg]"])


               Age[years]      Sex  Weight[kg] Eye Color  Body Temperature[C]
individuum 1           42   female        52.9     brown                 36.9
individuum 2           37     male        87.0     green                 36.3
individuum 3           29     male        82.1      blue                 36.4
individuum 4           61   female        62.5      blue                 36.7
individuum 5           77   female        55.5      gray                 36.6
individuum 6           33     male        95.2     green                 36.5
individuum 7           32   female        81.8     brown                 37.0
individuum 8           45     male        78.9     brown                 36.3
individuum 9           18     male         NaN     green                 36.6
individuum 10          19     male        84.7      gray                 36.1
individuum 1     52.9
individuum 2     87.0
individuum 3     82.1
individuum 4     62.5
individuum 5     55.5
individuum 6     95.2
individuum 7     81.8
individuum 8     78.9
individuum 9      NaN
individuum 10    84.7
Name: Weight[kg], dtype: float64
Out[6]:
individuum 1     False
individuum 2     False
individuum 3     False
individuum 4     False
individuum 5     False
individuum 6     False
individuum 7     False
individuum 8     False
individuum 9      True
individuum 10    False
Name: Weight[kg], dtype: bool

You should be asking yourself... Why using np.nan and not None

The answer is (or at least should be) that np.nan allows vectorized operations.


In [7]:
import numpy as np

v1 = pd.Series([1,2,3,4])
v2 = pd.Series([1,2,None,4])
v3 = pd.Series([1,2,np.nan,4])

# this can arise problems
print(v1 * v2)
# this shouldn't
print(v2 * v3)


0     1.0
1     4.0
2     NaN
3    16.0
dtype: float64
0     1.0
1     4.0
2     NaN
3    16.0
dtype: float64

Exercises

Exercise 1: Load iqsize.csv using csv library. The result should be a list of lists.

Exercise 2: Do you think that there exist any advantage of using dictionaries?

Exercise 3: Load iqsize.csv taking advantage of indexes and dictionaries. Describe the problems that you faced so far.

Exercise 4: Identify dataset variables. Are they quantitative or qualitative? Can you identify the units? If you have enough time change unit to metric system.

Exercise 5: Check the range of quantitative variables. Are they correct? If not correct how would you correct it (don't expend many time). (If you have and error, treat the exception and pass it)

Exercise 6: Check the labels of qualitative variables. Are they correct? If not correct them, how would you correct them?

Exercise 7: For quantitative variables, compute the mean and median.

Exercise 8: For qualitative variables, count how many observations of each label exist.

Exercise 9: Compute Exercise 7 statistics, but now for each label in Sex variable.

Bibliography

[1] Jonge van der Loo, Introduction to data cleaning with R - https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf

[2] Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. Wiley-IEEE.

[3] Hadley Wickman. Tidy Data. http://vita.had.co.nz/papers/tidy-data.pdf

[4] http://stattrek.com/descriptive-statistics/variables.aspx