The aim of this notebook is to introduce you to basic data cleaning using Python and Pandas. Most of the contents follow the ideas presented the great report of Jonge van der Loo - [Introduction to data cleaning with R][1].
As explained in 1, most of Data Scientist's work is spent in cleaning preparing data before any statistical analysis or model application. It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. Wiley-IEEE.). Of course one can find data sources date are ready to go, but usually these sources are already explored data sets on practical examples. The reality however, is that data is full of errors, without format consistency and potentially incomplete. The Data Scientist mission is to convert these raw data sources into consistent data sets that can be used as input for further analysis.
Even with technically correct data sets, and after hard work on cleaning, checking, filling data sets can lack of a standard way to organize data values within a dataset. Hadley Wickham defined this standardization as Tidy Data.
Van der Loo defines statistical data analysis in five steps:
| type checking, normalizing
v
| fix and impute
v
| estimate, analyze, derive, etc.
v
| tabulate, plot
v
In the previous graph you find a numbered list with five items. Each item represent data in different states and arrows represent actions needed at each step to move to the next one. It has to be noted that when data is transformed from first to fifth state it gains value at each step in an incremental way.
At the first stage, we have data as is. Raw data is a rough diamond that is going to be cut and shined at each step. Among the errors we can find there are: wrong types, different variable encoding, data without labels, etc.
Technically correct data, is data that can be loaded into Pandas structures, let's say that it has the proper "shape" with correct names, types, labels and so on. However variables may be out of range or potentially inconsistent (relations between variables).
In consistent data stage, data is ready for statistical inference. For example the total amount of incomes in a year is the sum of all months incomes.
The later stages contain statistical results derived from the analysis that ultimately can be formatted to provide a synthetic layout.
Best Practice It is a good idea to keep the input of each step in a local file, and the methods applied ready to be reproduced at each stage (at least).
Why? We will see that at each stage, we can potentially loose or modify initial data. This loose or modification of data can influence final analysis. All operations performed over a data set should be reproducible.
Python offers a good interactive environment that facilitates the transformation and computation of datasets while generating a nice scripting framework to reproduce procedures.
Data cleaning can be seen as the first step on statistical analysis, and as programmers we tend to forget or mess the statistical terms. What an statistician says when it says variable? For a computer programmer, a variable is a memory space that can be filled with a know (or unknown) quantity of information (a.k.a. value). Moreover, this space has an associated notation alias that can be used in a program in running time to modify the value of the variable. Well, don't take this as an exact definition, but it helps to provide us a general refresh of what a variable is (for us the CS).
Well, statisticians have their own variables, lets have an (again) informal definition. In statistics, a variable is an attribute that describes a person, place, thing, or idea (often referred as feature).
As an example, we can take the list of physical characteristics of 10 persons. The objects of the matrix are the persons, the variables are the measured properties, such as the weight or the color of the eyes.
In [2]:
import pandas as pd
df = pd.read_csv('../data/people.csv',index_col=0)
df
Out[2]:
Variables can be classified as qualitative (aka, categorical) or quantitative (aka, numeric).
Qualitative: Qualitative variables take on values that are names or labels. The eye color (e.g., brown, green, gray, etc.) or the sex of the person (e.g., female, male) would be examples of qualitative or categorical variables.
Quantitative: Quantitative variables are numeric. They represent a measurable quantity. For example, when we speak of the age of a person, we are talking about the time past from its birth - a measurable attribute of the persom. Therefore, age would be a qualitative variable.
Basically in Pandas there are two fundamental data structures Series and DataFrame. According to the previous definition of statistical variables, a Series would be a variable an a DataFrame would be a set of variables. Moreover, if we slice a DataFrame (set of variables) we get a Series (variable).
In [3]:
print(type(df["Sex"]))
And what kind of variable is for example the weight of a person?
In [4]:
print(df["Age[years]"].dtype)
And what about categorical variables? Well, that deserve more than a word, so commonly for us categorical variables will have object dtype.
In [5]:
print(df["Sex"].dtype)
print (df["Eye Color"].dtype)
To know more about this, give a read to these links:
There are some values that are considered as special. This is the case when we have missing data (this is when no data value is stored for the variable in an observation). There are other cases, such as Inf and -Inf, etc.
Let's suppose that we are getting individuum attributes, and we loose or the height of an individuum can't be taken. We always try to maximize all the data we have, so the best way to deal with this kind of situations is to mark these value as missing.
In [6]:
import pandas as pd
df = pd.read_csv('../data/people_with_nas.csv',index_col=0,na_values="NaN")
print(df)
print(df["Weight[kg]"])
pd.isnull(df["Weight[kg]"])
Out[6]:
You should be asking yourself... Why using np.nan and not None
The answer is (or at least should be) that np.nan allows vectorized operations.
In [7]:
import numpy as np
v1 = pd.Series([1,2,3,4])
v2 = pd.Series([1,2,None,4])
v3 = pd.Series([1,2,np.nan,4])
# this can arise problems
print(v1 * v2)
# this shouldn't
print(v2 * v3)
Exercise 1: Load iqsize.csv using csv library. The result should be a list of lists.
Exercise 2: Do you think that there exist any advantage of using dictionaries?
Exercise 3: Load iqsize.csv taking advantage of indexes and dictionaries. Describe the problems that you faced so far.
Exercise 4: Identify dataset variables. Are they quantitative or qualitative? Can you identify the units? If you have enough time change unit to metric system.
Exercise 5: Check the range of quantitative variables. Are they correct? If not correct how would you correct it (don't expend many time). (If you have and error, treat the exception and pass it)
Exercise 6: Check the labels of qualitative variables. Are they correct? If not correct them, how would you correct them?
Exercise 7: For quantitative variables, compute the mean and median.
Exercise 8: For qualitative variables, count how many observations of each label exist.
Exercise 9: Compute Exercise 7 statistics, but now for each label in Sex variable.
[1] Jonge van der Loo, Introduction to data cleaning with R - https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
[2] Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. Wiley-IEEE.
[3] Hadley Wickman. Tidy Data. http://vita.had.co.nz/papers/tidy-data.pdf
[4] http://stattrek.com/descriptive-statistics/variables.aspx