Author: Andrew Andrade (andrew@andrewandrade.ca)
This is complimentory tutorial for datascienceguide.github.io outlining the basics of exploratory data analysis
In this tutorial, we will learn to open a comma seperated value (CSV) data file and make find summary statistics and basic visualizations on the variables in the Ansombe dataset (to see the importance of visualization). Next we will investigate Fisher's Iris data set using more powerful visualizations.
These tutorials assumes a basic understanding of python so for those new to python, understanding basic syntax will be very helpful. I recommend writing python code in Jupyter notebook as it allows you to rapidly prototype and annotate your code.
Python is a very easy language to get started with and there are many guides: Full list: http://docs.python-guide.org/en/latest/intro/learning/
My favourite resources: https://docs.python.org/2/tutorial/introduction.html https://docs.python.org/2/tutorial/ http://learnpythonthehardway.org/book/ https://www.udacity.com/wiki/cs101/%3A-python-reference http://rosettacode.org/wiki/Category:Python
Once you are familiar with python, the first part of this guide is useful in learning some of the libraries we will be using: http://cs231n.github.io/python-numpy-tutorial
In addition, the following post helps teach the basics for data analysis in python:
http://www.analyticsvidhya.com/blog/2014/07/baby-steps-libraries-data-structure/ http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
We should store this in a known location on our local computer or server. The simplist way is to download and save it in the same folder you launch Jupyter notebook from, but I prefer to save my datasets in a datasets folder 1 directory up from my tutorial code (../datasets/).
You should dowload the following CSVs:
http://datascienceguide.github.io/datasets/anscombe_i.csv
http://datascienceguide.github.io/datasets/anscombe_ii.csv
http://datascienceguide.github.io/datasets/anscombe_iii.csv
http://datascienceguide.github.io/datasets/anscombe_iv.csv
http://datascienceguide.github.io/datasets/iris.csv
If using a server, you can download the file by using the following command:
wget http://datascienceguide.github.io/datasets/iris.csv
Now we can run the following code to open the csv.
In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
anscombe_i = pd.read_csv('../datasets/anscombe_i.csv')
anscombe_ii = pd.read_csv('../datasets/anscombe_ii.csv')
anscombe_iii = pd.read_csv('../datasets/anscombe_iii.csv')
anscombe_iv = pd.read_csv('../datasets/anscombe_iv.csv')
The first three lines of code import libraries we are using and renames to shorter names.
Matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. We will use it for basic graphics
Numpy is the fundamental package for scientific computing with Python. It contains among other things:
Pandas is open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
It extends the numpy array to allow for columns of different variable types.
Since we are using Jupyter notebook we use the line %matplotlib inline
to tell python to put the figures inline with the notebook (instead of a popup)
pd.read_csv opens a .csv file and stores it into a dataframe object which we call anscombe_i, anscombe_ii, etc.
Next, let us see the structure of the data by printing the first 5 rows (using [:5]) data set:
In [2]:
print anscombe_i[0:5]
Now let us use the describe function to see the 3 most basic summary statistics
In [3]:
print "Data Set I"
print anscombe_i.describe()[:3]
print "Data Set II"
print anscombe_ii.describe()[:3]
print "Data Set III"
print anscombe_iii.describe()[:3]
print "Data Set IV"
print anscombe_iv.describe()[:3]
It appears that the datasets are almost identical by looking only at the mean and the standard deviation. Instead, let us make a scatter plot for each of the data sets.
Since the data is stored in a data frame (similar to an excel sheet), we can see the column names on top and we can access the columns using the following syntax
anscombe_i.x
anscombe_i.y
or
anscombe_i['x']
anscombe_i['y']
In [4]:
plt.figure(1)
plt.scatter(anscombe_i.x, anscombe_i.y, color='black')
plt.title("anscombe_i")
plt.xlabel("x")
plt.ylabel("y")
plt.figure(2)
plt.scatter(anscombe_ii.x, anscombe_ii.y, color='black')
plt.title("anscombe_ii")
plt.xlabel("x")
plt.ylabel("y")
plt.figure(3)
plt.scatter(anscombe_iii.x, anscombe_iii.y, color='black')
plt.title("anscombe_iii")
plt.xlabel("x")
plt.ylabel("y")
plt.figure(4)
plt.scatter(anscombe_iv.x, anscombe_iv.y, color='black')
plt.title("anscombe_iv")
plt.xlabel("x")
plt.ylabel("y")
Out[4]:
Shockily we can clearly see that the datasets are quite different! The first data set has pure irreducable error, the second data set is not linear, the third dataset has an outlier, and the fourth dataset all of x values are the same except for an outlier. If you do not believe me, I uploaded an excel worksheet with the full datasets and summary statistics here
Now let us learn how to make a box plot. Before writing this tutorial I didn't know how to make a box plot in matplotlib (I usually use seaborn which we will learn soon). I did a quick google search for "box plot matplotlib) and found an example here which outlines a couple of styling options.
In [5]:
# basic box plot
plt.figure(1)
plt.boxplot(anscombe_i.y)
plt.title("anscombe_i y box plot")
Out[5]:
Trying reading the documentation for the box plot above and make your own visuaizations.
Next we are going to learn how to use Seaborn which is a very powerful visualization library. Matplotlib is a great library and has many examples of different plots, but seaborn is built on top of matplot lib and offers better plots for statistical analysis. If you do not have seaborn installed, you can follow the instructions here: http://stanford.edu/~mwaskom/software/seaborn/installing.html#installing . Seaborn also has many examples and also has a tutorial.
To show the power of the library we are going to plot the anscombe datasets in 1 plot following this example: http://stanford.edu/~mwaskom/software/seaborn/examples/anscombes_quartet.html . Do not worry to much about what the code does (it loads the same dataset and changes setting to make the visualization clearer), we will get more experince with seaborn soon.
In [6]:
import seaborn as sns
sns.set(style="ticks")
# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")
# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
col_wrap=2, ci=None, palette="muted", size=4,
scatter_kws={"s": 50, "alpha": 1})
Out[6]:
Seaborn does linear regression automatically (which we will learn soon). We can also see that the linear regression is the same for each dataset even though they are quite different.
The big takeway here is that summary statistics can be deceptive! Always make visualizations of your data before making any models.
In [7]:
iris = pd.read_csv('../datasets/iris.csv')
print iris[0:5]
print iris.describe()
As we can see, it is difficult to interpret the results. We can see that sepal length, sepal width, petal length and petal width are all numeric features, and the iris variable is the specific type of iris (or categorical variable). To better understand the data, we can split the data based on each type of iris, make a histogram for each numeric feature, scatter plot between features and make many visualizations. I will demonstrate the process for generating a histogram for sepal length of Iris-setosa and a scatter plot for sepal length vs width for Iris-setosa
In [8]:
#select all Iris-setosa
iris_setosa = iris[iris.iris == "Iris-setosa"]
plt.figure(1)
#make histogram of sepal lenth
plt.hist(iris_setosa["sepal length"])
plt.xlabel("sepal length")
plt.figure(2)
plt.scatter(iris_setosa["sepal width"], iris_setosa["sepal length"] )
plt.xlabel("sepal width")
plt.ylabel("sepal lenth")
Out[8]:
This would help us to better undestand the data and is necessary for good analysis, but to do this for all the features and iris types (classes) would take a significant amount of time. Seaborn has a function called the pairplot which will do all of that for us!
In [9]:
sns.pairplot(iris, hue="iris")
Out[9]:
We have a much better understanding of the data. For example we can see linear correlations between some of the numeric features. We can also see which numeric features seperate seperate the types of iris well and which would not.
Exploratory data analysis is not done! We could spend a whole course on doing exploratory data analysis (I took one when I was on exchange at the National Univesity of Singapore). For this reason, EDA will be a re-occuring theme in these tutorials and we will continue to always visualizate data. Data will come in different forms, it is our role as data scientists to quickly and effectively understand data.
In the next tutorial we will be using the ansombe dataset for regression, and in future tutorials we will re-visting the iris dataset to do classification.
Exploratory data analysis is always an ongoing process, and we we have learnt in this tutorial, it is a necessary step before we start modeling. The way to get better at plotting data is to get started plotting! Pick an interesting dataset you can find and start exploring!
Here a some datasets to get you started:
http://www.kdnuggets.com/datasets/index.html https://github.com/caesar0301/awesome-public-datasets https://github.com/datasciencemasters/data https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public http://opendata.city-of-waterloo.opendata.arcgis.com/ https://github.com/uWaterloo/Datasets
You can also look for examples and sample code online for others using matplotlib, seaborn and ggplot2 (for those using R) for inspiration.
Have fun!