CSE 6040, Fall 2015 [08]: Data analysis and visualization

In todays class, we will first introduce a data analysis tools called Pandas, and then show how to visualize the data using a module called Seaborn.

Most of the examples come from Pandas tutorial and Seaborn tutorial.

Part 1: Data analysis using Pandas

Pandas is pre-installed with Anaconda. Let's try to import it.



In [12]:

    
import pandas as pd

Create Data

The data set will consist of 5 baby names and the number of births recorded for that year (1880).



In [2]:

    
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

To merge these two lists together we will use the zip function.



In [3]:

    
BabyDataSet = zip(names,births)
BabyDataSet









    Out[3]:





[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

We are basically done creating the data set. We now will use the pandas library to export this data set into a csv file.

We will create a DataFrame object. You can think of this object holding the contents of the BabyDataSet in a format similar to an excel spreadsheet.



In [4]:

    
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df

Export the dataframe to a csv file. We can name the file births1880.csv. The function to_csv will be used to export the file. The file will be saved in the same location of the notebook unless specified otherwise.



In [5]:

    
df.to_csv('births1880.csv',index=False,header=False)

Get Data

To pull in the csv file, we will use the pandas function read_csv. Let us take a look at this function and what inputs it takes.



In [6]:

    
df = pd.read_csv("births1880.csv")
df

This brings us the our first problem of the exercise. The read_csv function treated the first record in the csv file as the header names. This is obviously not correct since the text file did not provide us with header names.

To correct this we will pass the header parameter to the read_csv function and set it to None (means null in python).



In [7]:

    
df = pd.read_csv("births1880.csv", header=None)
df

If we wanted to give the columns specific names, we would have to pass another paramter called names. We can also omit the header parameter.



In [8]:

    
df = pd.read_csv("births1880.csv", names=['Names','Births'])
df

It is also possible to read in a csv file by passing an url address Here we use the famous Iris dataset.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.



In [10]:

    
df = pd.read_csv("https://raw.githubusercontent.com/bigmlcom/bigmler/master/data/iris.csv")
df.head(10)









    Out[10]:






  
    
      
      sepal length
      sepal width
      petal length
      petal width
      species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      Iris-setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      Iris-setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      Iris-setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      Iris-setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      Iris-setosa
    
    
      5
      5.4
      3.9
      1.7
      0.4
      Iris-setosa
    
    
      6
      4.6
      3.4
      1.4
      0.3
      Iris-setosa
    
    
      7
      5.0
      3.4
      1.5
      0.2
      Iris-setosa
    
    
      8
      4.4
      2.9
      1.4
      0.2
      Iris-setosa
    
    
      9
      4.9
      3.1
      1.5
      0.1
      Iris-setosa

Analyze Data



In [11]:

    
# show basic statistics
df.describe()









    Out[11]:






  
    
      
      sepal length
      sepal width
      petal length
      petal width
    
  
  
    
      count
      150.000000
      150.000000
      150.000000
      150.000000
    
    
      mean
      5.843333
      3.057333
      3.758000
      1.199333
    
    
      std
      0.828066
      0.435866
      1.765298
      0.762238
    
    
      min
      4.300000
      2.000000
      1.000000
      0.100000
    
    
      25%
      5.100000
      2.800000
      1.600000
      0.300000
    
    
      50%
      5.800000
      3.000000
      4.350000
      1.300000
    
    
      75%
      6.400000
      3.300000
      5.100000
      1.800000
    
    
      max
      7.900000
      4.400000
      6.900000
      2.500000



In [12]:

    
# Select a column
df["sepal length"].head()









    Out[12]:





0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal length, dtype: float64



In [13]:

    
# select columns
df[["sepal length", "petal width"]].head()









    Out[13]:






  
    
      
      sepal length
      petal width
    
  
  
    
      0
      5.1
      0.2
    
    
      1
      4.9
      0.2
    
    
      2
      4.7
      0.2
    
    
      3
      4.6
      0.2
    
    
      4
      5.0
      0.2



In [14]:

    
# select rows by name
df.loc[5:10]









    Out[14]:






  
    
      
      sepal length
      sepal width
      petal length
      petal width
      species
    
  
  
    
      5
      5.4
      3.9
      1.7
      0.4
      Iris-setosa
    
    
      6
      4.6
      3.4
      1.4
      0.3
      Iris-setosa
    
    
      7
      5.0
      3.4
      1.5
      0.2
      Iris-setosa
    
    
      8
      4.4
      2.9
      1.4
      0.2
      Iris-setosa
    
    
      9
      4.9
      3.1
      1.5
      0.1
      Iris-setosa
    
    
      10
      5.4
      3.7
      1.5
      0.2
      Iris-setosa



In [15]:

    
# select rows by position
df.iloc[5:10]









    Out[15]:






  
    
      
      sepal length
      sepal width
      petal length
      petal width
      species
    
  
  
    
      5
      5.4
      3.9
      1.7
      0.4
      Iris-setosa
    
    
      6
      4.6
      3.4
      1.4
      0.3
      Iris-setosa
    
    
      7
      5.0
      3.4
      1.5
      0.2
      Iris-setosa
    
    
      8
      4.4
      2.9
      1.4
      0.2
      Iris-setosa
    
    
      9
      4.9
      3.1
      1.5
      0.1
      Iris-setosa



In [16]:

    
# select rows by condition
df[df["sepal length"] > 5.0]









    Out[16]:






  
    
      
      sepal length
      sepal width
      petal length
      petal width
      species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      Iris-setosa
    
    
      5
      5.4
      3.9
      1.7
      0.4
      Iris-setosa
    
    
      10
      5.4
      3.7
      1.5
      0.2
      Iris-setosa
    
    
      14
      5.8
      4.0
      1.2
      0.2
      Iris-setosa
    
    
      15
      5.7
      4.4
      1.5
      0.4
      Iris-setosa
    
    
      16
      5.4
      3.9
      1.3
      0.4
      Iris-setosa
    
    
      17
      5.1
      3.5
      1.4
      0.3
      Iris-setosa
    
    
      18
      5.7
      3.8
      1.7
      0.3
      Iris-setosa
    
    
      19
      5.1
      3.8
      1.5
      0.3
      Iris-setosa
    
    
      20
      5.4
      3.4
      1.7
      0.2
      Iris-setosa
    
    
      21
      5.1
      3.7
      1.5
      0.4
      Iris-setosa
    
    
      23
      5.1
      3.3
      1.7
      0.5
      Iris-setosa
    
    
      27
      5.2
      3.5
      1.5
      0.2
      Iris-setosa
    
    
      28
      5.2
      3.4
      1.4
      0.2
      Iris-setosa
    
    
      31
      5.4
      3.4
      1.5
      0.4
      Iris-setosa
    
    
      32
      5.2
      4.1
      1.5
      0.1
      Iris-setosa
    
    
      33
      5.5
      4.2
      1.4
      0.2
      Iris-setosa
    
    
      36
      5.5
      3.5
      1.3
      0.2
      Iris-setosa
    
    
      39
      5.1
      3.4
      1.5
      0.2
      Iris-setosa
    
    
      44
      5.1
      3.8
      1.9
      0.4
      Iris-setosa
    
    
      46
      5.1
      3.8
      1.6
      0.2
      Iris-setosa
    
    
      48
      5.3
      3.7
      1.5
      0.2
      Iris-setosa
    
    
      50
      7.0
      3.2
      4.7
      1.4
      Iris-versicolor
    
    
      51
      6.4
      3.2
      4.5
      1.5
      Iris-versicolor
    
    
      52
      6.9
      3.1
      4.9
      1.5
      Iris-versicolor
    
    
      53
      5.5
      2.3
      4.0
      1.3
      Iris-versicolor
    
    
      54
      6.5
      2.8
      4.6
      1.5
      Iris-versicolor
    
    
      55
      5.7
      2.8
      4.5
      1.3
      Iris-versicolor
    
    
      56
      6.3
      3.3
      4.7
      1.6
      Iris-versicolor
    
    
      58
      6.6
      2.9
      4.6
      1.3
      Iris-versicolor
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      120
      6.9
      3.2
      5.7
      2.3
      Iris-virginica
    
    
      121
      5.6
      2.8
      4.9
      2.0
      Iris-virginica
    
    
      122
      7.7
      2.8
      6.7
      2.0
      Iris-virginica
    
    
      123
      6.3
      2.7
      4.9
      1.8
      Iris-virginica
    
    
      124
      6.7
      3.3
      5.7
      2.1
      Iris-virginica
    
    
      125
      7.2
      3.2
      6.0
      1.8
      Iris-virginica
    
    
      126
      6.2
      2.8
      4.8
      1.8
      Iris-virginica
    
    
      127
      6.1
      3.0
      4.9
      1.8
      Iris-virginica
    
    
      128
      6.4
      2.8
      5.6
      2.1
      Iris-virginica
    
    
      129
      7.2
      3.0
      5.8
      1.6
      Iris-virginica
    
    
      130
      7.4
      2.8
      6.1
      1.9
      Iris-virginica
    
    
      131
      7.9
      3.8
      6.4
      2.0
      Iris-virginica
    
    
      132
      6.4
      2.8
      5.6
      2.2
      Iris-virginica
    
    
      133
      6.3
      2.8
      5.1
      1.5
      Iris-virginica
    
    
      134
      6.1
      2.6
      5.6
      1.4
      Iris-virginica
    
    
      135
      7.7
      3.0
      6.1
      2.3
      Iris-virginica
    
    
      136
      6.3
      3.4
      5.6
      2.4
      Iris-virginica
    
    
      137
      6.4
      3.1
      5.5
      1.8
      Iris-virginica
    
    
      138
      6.0
      3.0
      4.8
      1.8
      Iris-virginica
    
    
      139
      6.9
      3.1
      5.4
      2.1
      Iris-virginica
    
    
      140
      6.7
      3.1
      5.6
      2.4
      Iris-virginica
    
    
      141
      6.9
      3.1
      5.1
      2.3
      Iris-virginica
    
    
      142
      5.8
      2.7
      5.1
      1.9
      Iris-virginica
    
    
      143
      6.8
      3.2
      5.9
      2.3
      Iris-virginica
    
    
      144
      6.7
      3.3
      5.7
      2.5
      Iris-virginica
    
    
      145
      6.7
      3.0
      5.2
      2.3
      Iris-virginica
    
    
      146
      6.3
      2.5
      5.0
      1.9
      Iris-virginica
    
    
      147
      6.5
      3.0
      5.2
      2.0
      Iris-virginica
    
    
      148
      6.2
      3.4
      5.4
      2.3
      Iris-virginica
    
    
      149
      5.9
      3.0
      5.1
      1.8
      Iris-virginica
    
  

118 rows × 5 columns

We can get the maximum sepal length by



In [17]:

    
df["sepal length"].max()









    Out[17]:





7.9

If we want to find full information of the flower with maximum sepal length



In [18]:

    
df.sort("sepal length", ascending=False).head(1)









    Out[18]:






  
    
      
      sepal length
      sepal width
      petal length
      petal width
      species
    
  
  
    
      131
      7.9
      3.8
      6.4
      2
      Iris-virginica

Exercise

Print the full information of the flower whose petal length is the second shortest in the 50 Iris-setosa flowers



In [25]:

    
df.sort("petal length", ascending=True)
df.iloc[1]









    Out[25]:





sepal length            4.9
sepal width               3
petal length            1.4
petal width             0.2
species         Iris-setosa
Name: 1, dtype: object

Pandas also has some basic plotting functions.



In [39]:

    
import matplotlib.pyplot as plt
%matplotlib inline
df.hist()









    Out[39]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10f89b810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10fbc3410>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10fc4a050>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10fcacbd0>]], dtype=object)

Part 2: Visualization using Seaborn

Seaborn is not installed by default in Anaconda.

Try install it using pip: pip install seaborn.



In [1]:

    
import seaborn as sns

# make the plots to show right below the codes
% matplotlib inline

Plotting univariate distributions

distplot() function will draw a histogram and fit a kernel density estimate



In [2]:

    
import numpy as np
x = np.random.normal(size=100)
sns.distplot(x)









    Out[2]:





<matplotlib.axes._subplots.AxesSubplot at 0x10976df90>



In [10]:

    
import random
x = [random.normalvariate (0, 1) for i in range (0, 1000)]
sns.distplot (x)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a669c10>

Plotting bivariate distributions



In [16]:

    
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])

Scatter plot



In [17]:

    
sns.jointplot(x="x", y="y", data=df)









    Out[17]:





<seaborn.axisgrid.JointGrid at 0x10a849a50>

Hexbin plot



In [18]:

    
sns.jointplot(x="x", y="y", data=df, kind="hex")









    Out[18]:





<seaborn.axisgrid.JointGrid at 0x10b50c710>

Kernel density estimation



In [19]:

    
sns.jointplot(x="x", y="y", data=df, kind="kde")









    Out[19]:





<seaborn.axisgrid.JointGrid at 0x10b75c8d0>

Visualizing pairwise relationships in a dataset

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:



In [21]:

    
iris = sns.load_dataset("iris")
sns.pairplot(iris)









    Out[21]:





<seaborn.axisgrid.PairGrid at 0x10b18de50>



In [22]:

    
# we can add colors to different species
sns.pairplot(iris, hue="species")









    Out[22]:





<seaborn.axisgrid.PairGrid at 0x110a5cfd0>

Visualizing linear relationships



In [23]:

    
tips = sns.load_dataset("tips")
tips.head()









    Out[23]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4

We can use the function regplot to show the linear relationship between total_bill and tip. It also shows the 95% confidence interval.



In [24]:

    
sns.regplot(x="total_bill", y="tip", data=tips)









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x111e37d10>

Visualizing higher order relationships



In [25]:

    
anscombe = sns.load_dataset("anscombe")
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"])









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x112817a90>

The plot clearly shows that this is not a good model. Let's try to fit a polynomial regression model with degree 2.



In [164]:

    
sns.regplot(x="x", y="y", data=anscombe[anscombe["dataset"] == "II"], order=2)









    Out[164]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b683190>

Strip mplots

This is similar to scatter plot but used when one variable is categorical.



In [168]:

    
sns.stripplot(x="day", y="total_bill", data=tips)









    Out[168]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c47b3d0>

Boxplots



In [169]:

    
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)









    Out[169]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b62d690>

Bar plots



In [170]:

    
titanic = sns.load_dataset("titanic")
sns.barplot(x="sex", y="survived", hue="class", data=titanic)









    Out[170]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c2d5f90>

	sepal length	sepal width	petal length	petal width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
7	5.0	3.4	1.5	0.2	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa

	sepal length	sepal width	petal length	petal width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4