CSE 6040, Fall 2015 [08]: Data analysis and visualization

In todays class, we will first introduce a data analysis tools called Pandas, and then show how to visualize the data using a module called Seaborn.

Most of the examples come from Pandas tutorial and Seaborn tutorial.

Part 1: Data analysis using Pandas

Pandas is pre-installed with Anaconda. Let's try to import it.


In [12]:
import pandas as pd

Create Data

The data set will consist of 5 baby names and the number of births recorded for that year (1880).


In [2]:
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

To merge these two lists together we will use the zip function.


In [3]:
BabyDataSet = zip(names,births)
BabyDataSet


Out[3]:
[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

We are basically done creating the data set. We now will use the pandas library to export this data set into a csv file.

We will create a DataFrame object. You can think of this object holding the contents of the BabyDataSet in a format similar to an excel spreadsheet.


In [4]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df


Out[4]:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel 973

Export the dataframe to a csv file. We can name the file births1880.csv. The function to_csv will be used to export the file. The file will be saved in the same location of the notebook unless specified otherwise.


In [5]:
df.to_csv('births1880.csv',index=False,header=False)

Get Data

To pull in the csv file, we will use the pandas function read_csv. Let us take a look at this function and what inputs it takes.


In [6]:
df = pd.read_csv("births1880.csv")
df


Out[6]:
Bob 968
0 Jessica 155
1 Mary 77
2 John 578
3 Mel 973

This brings us the our first problem of the exercise. The read_csv function treated the first record in the csv file as the header names. This is obviously not correct since the text file did not provide us with header names.

To correct this we will pass the header parameter to the read_csv function and set it to None (means null in python).


In [7]:
df = pd.read_csv("births1880.csv", header=None)
df


Out[7]:
0 1
0 Bob 968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel 973

If we wanted to give the columns specific names, we would have to pass another paramter called names. We can also omit the header parameter.


In [8]:
df = pd.read_csv("births1880.csv", names=['Names','Births'])
df


Out[8]:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel 973

It is also possible to read in a csv file by passing an url address Here we use the famous Iris dataset.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.


In [10]:
df = pd.read_csv("https://raw.githubusercontent.com/bigmlcom/bigmler/master/data/iris.csv")
df.head(10)


Out[10]:
sepal length sepal width petal length petal width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa

Analyze Data


In [11]:
# show basic statistics
df.describe()


Out[11]:
sepal length sepal width petal length petal width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

In [12]:
# Select a column
df["sepal length"].head()


Out[12]:
0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal length, dtype: float64

In [13]:
# select columns
df[["sepal length", "petal width"]].head()


Out[13]:
sepal length petal width
0 5.1 0.2
1 4.9 0.2
2 4.7 0.2
3 4.6 0.2
4 5.0 0.2

In [14]:
# select rows by name
df.loc[5:10]


Out[14]:
sepal length sepal width petal length petal width species
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa

In [15]:
# select rows by position
df.iloc[5:10]


Out[15]:
sepal length sepal width petal length petal width species
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa

In [16]:
# select rows by condition
df[df["sepal length"] > 5.0]


Out[16]:
sepal length sepal width petal length petal width species
0 5.1 3.5 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa
21 5.1 3.7 1.5 0.4 Iris-setosa
23 5.1 3.3 1.7 0.5 Iris-setosa
27 5.2 3.5 1.5 0.2 Iris-setosa
28 5.2 3.4 1.4 0.2 Iris-setosa
31 5.4 3.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
33 5.5 4.2 1.4 0.2 Iris-setosa
36 5.5 3.5 1.3 0.2 Iris-setosa
39 5.1 3.4 1.5 0.2 Iris-setosa
44 5.1 3.8 1.9 0.4 Iris-setosa
46 5.1 3.8 1.6 0.2 Iris-setosa
48 5.3 3.7 1.5 0.2 Iris-setosa
50 7.0 3.2 4.7 1.4 Iris-versicolor
51 6.4 3.2 4.5 1.5 Iris-versicolor
52 6.9 3.1 4.9 1.5 Iris-versicolor
53 5.5 2.3 4.0 1.3 Iris-versicolor
54 6.5 2.8 4.6 1.5 Iris-versicolor
55 5.7 2.8 4.5 1.3 Iris-versicolor
56 6.3 3.3 4.7 1.6 Iris-versicolor
58 6.6 2.9 4.6 1.3 Iris-versicolor
... ... ... ... ... ...
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
127 6.1 3.0 4.9 1.8 Iris-virginica
128 6.4 2.8 5.6 2.1 Iris-virginica
129 7.2 3.0 5.8 1.6 Iris-virginica
130 7.4 2.8 6.1 1.9 Iris-virginica
131 7.9 3.8 6.4 2.0 Iris-virginica
132 6.4 2.8 5.6 2.2 Iris-virginica
133 6.3 2.8 5.1 1.5 Iris-virginica
134 6.1 2.6 5.6 1.4 Iris-virginica
135 7.7 3.0 6.1 2.3 Iris-virginica
136 6.3 3.4 5.6 2.4 Iris-virginica
137 6.4 3.1 5.5 1.8 Iris-virginica
138 6.0 3.0 4.8 1.8 Iris-virginica
139 6.9 3.1 5.4 2.1 Iris-virginica
140 6.7 3.1 5.6 2.4 Iris-virginica
141 6.9 3.1 5.1 2.3 Iris-virginica
142 5.8 2.7 5.1 1.9 Iris-virginica
143 6.8 3.2 5.9 2.3 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

118 rows × 5 columns

We can get the maximum sepal length by


In [17]:
df["sepal length"].max()


Out[17]:
7.9

If we want to find full information of the flower with maximum sepal length


In [18]:
df.sort("sepal length", ascending=False).head(1)


Out[18]:
sepal length sepal width petal length petal width species
131 7.9 3.8 6.4 2 Iris-virginica

Exercise

Print the full information of the flower whose petal length is the second shortest in the 50 Iris-setosa flowers


In [25]:
df.sort("petal length", ascending=True)
df.iloc[1]


Out[25]:
sepal length            4.9
sepal width               3
petal length            1.4
petal width             0.2
species         Iris-setosa
Name: 1, dtype: object

Pandas also has some basic plotting functions.


In [39]:
import matplotlib.pyplot as plt
%matplotlib inline
df.hist()


Out[39]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10f89b810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10fbc3410>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10fc4a050>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10fcacbd0>]], dtype=object)

Part 2: Visualization using Seaborn

Seaborn is not installed by default in Anaconda.

Try install it using pip: pip install seaborn.


In [1]:
import seaborn as sns

# make the plots to show right below the codes
% matplotlib inline

Plotting univariate distributions

distplot() function will draw a histogram and fit a kernel density estimate


In [2]:
import numpy as np
x = np.random.normal(size=100)
sns.distplot(x)


Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x10976df90>

In [10]:
import random
x = [random.normalvariate (0, 1) for i in range (0, 1000)]
sns.distplot (x)


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a669c10>

Plotting bivariate distributions


In [16]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])

Scatter plot


In [17]:
sns.jointplot(x="x", y="y", data=df)


Out[17]:
<seaborn.axisgrid.JointGrid at 0x10a849a50>

Hexbin plot


In [18]:
sns.jointplot(x="x", y="y", data=df, kind="hex")


Out[18]:
<seaborn.axisgrid.JointGrid at 0x10b50c710>

Kernel density estimation


In [19]:
sns.jointplot(x="x", y="y", data=df, kind="kde")


Out[19]:
<seaborn.axisgrid.JointGrid at 0x10b75c8d0>

Visualizing pairwise relationships in a dataset

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:


In [21]:
iris = sns.load_dataset("iris")
sns.pairplot(iris)


Out[21]:
<seaborn.axisgrid.PairGrid at 0x10b18de50>

In [22]:
# we can add colors to different species
sns.pairplot(iris, hue="species")


Out[22]:
<seaborn.axisgrid.PairGrid at 0x110a5cfd0>

Visualizing linear relationships


In [23]:
tips = sns.load_dataset("tips")
tips.head()


Out[23]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

We can use the function regplot to show the linear relationship between total_bill and tip. It also shows the 95% confidence interval.


In [24]:
sns.regplot(x="total_bill", y="tip", data=tips)


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x111e37d10>

Visualizing higher order relationships


In [25]:
anscombe = sns.load_dataset("anscombe")
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"])


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x112817a90>

The plot clearly shows that this is not a good model. Let's try to fit a polynomial regression model with degree 2.


In [164]:
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"], order=2)


Out[164]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b683190>

Strip mplots

This is similar to scatter plot but used when one variable is categorical.


In [168]:
sns.stripplot(x="day", y="total_bill", data=tips)


Out[168]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c47b3d0>

Boxplots


In [169]:
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)


Out[169]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b62d690>

Bar plots


In [170]:
titanic = sns.load_dataset("titanic")
sns.barplot(x="sex", y="survived", hue="class", data=titanic)


Out[170]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c2d5f90>