Data Exploration

Professor Robert J. Brunner

</DIV>

Introduction

In last notebook we explored our data, and in order to simplify subsequent tasks, we created a trimmed subset of our data. Once we have a clean data set, the next step in the data exporation process is to gain a better understanding of our data set. The easiest way to do this is to visually explore the data. In this Notebook, we use several visualization techniques in the Seaborn library to learn more about the flights data.

First we need to read the data into this Notebook. To simplify the process, we can simply read in the data from our HSF file. We quickly check the data over before we generate a trimmed data set (by slicing every 5000 row) before we begin to make visualizations.



In [1]:

    
import pandas as pd
data = pd.read_hdf('sflights.h5', 'table')



In [2]:

    
data.describe()









    Out[2]:






  
    
      
      month
      Day
      dTime
      aDelay
      dDelay
      distance
    
  
  
    
      count
       5723673.000000
       5723673.000000
       5723673.000000
       5723673.000000
       5723673.000000
       5723673.000000
    
    
      mean
             6.291581
             3.949829
          1348.688044
             5.528249
             8.115272
           735.173682
    
    
      std
             3.381755
             1.997942
           482.638757
            31.429291
            28.234083
           574.815182
    
    
      min
             1.000000
             1.000000
             1.000000
         -1116.000000
          -204.000000
            21.000000
    
    
      25%
             3.000000
             2.000000
           930.000000
            -9.000000
            -3.000000
           314.000000
    
    
      50%
             6.000000
             4.000000
          1333.000000
            -2.000000
             0.000000
           575.000000
    
    
      75%
             9.000000
             6.000000
          1740.000000
            10.000000
             6.000000
           983.000000
    
    
      max
            12.000000
             7.000000
          2400.000000
          1688.000000
          1692.000000
          4962.000000
    
  

8 rows × 6 columns



In [3]:

    
# Lets generate a trimmed data set to speed up exploration

tdata = data[::5000]
print(len(tdata))
print(tdata.head())









    



1145
       month  Day  dTime  aDelay  dDelay depart arrive  distance
0          1    3   1806      -3      -4    BWI    CLT       361
5232       1    3    706       2      -4    BUF    PIT       186
10406      1    4   1930      -2      -5    GSP    LGA       610
15577      1    5   2035      55      50    PHL    BWI        90
20790      1    1    624     -17      -6    BOS    CLT       728

[5 rows x 8 columns]

We can test the version of Seaborn we have by using the built-in help method to display the basic info about the Seaborn module. While you can do this in your own Notebook, my container has Seaborn version 0.5.1 installed. This is expected, however, since according to the class Dockerfile, seaborn was installed by using pip3 which retrieves the latest version of the module from the python package index.

The next step is to make a pair plot that allos an easy visual comparison of the different dimensions in our data set. We do this in Seaborn by using the PairGrid object to make scatter plots between each dimension.



In [4]:

    
import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns

sns.set(style="white")

pg = sns.PairGrid(tdata)

pg.map(plt.scatter)

Pair Plotting

We can select specific columns to sue to make pair plots by using Seaborn. For example in the previous set of plots, the month and Day columns do not visually provide significant information. We can remove those and plot the data of interest. In this case, we also explicitly set the axis limits of our plot to help highlight visual trends.



In [5]:

    
pg = sns.PairGrid(tdata[['dTime', 'aDelay', 'dDelay', 'distance']])

pg.map_diag(plt.hist)
pg.map_offdiag(plt.scatter)

# Lets explicitly set the axes limits
axes = pg.axes

xlim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
ylim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]

for i in range(len(xlim)):
    for j in range(len(ylim)):
        axes[i, j].set_xlim(xlim[j])
        axes[i, j].set_ylim(ylim[i])

The previous plots showed the aggregate data, but we can also differentiate the data by day of the week. To do this, we simply add a new column that contains the string listing the name (as opposed to the numerical value). We can add this column to our small data set by using a lambda function to use the numerical day as an index into our list of strings. Note that we obviously should verify this linear mapping, assumptions like this if incorrect can cause significant problems later.



In [6]:

    
# First we create a simple list of the week day names

dow = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']

# Now we add add a column byy converting from the int value to an index into our list.

tdata['DoW'] = tdata['Day'].apply(lambda x: dow[int(x - 1)])



In [7]:

    
# We can use the shorthand pairplot function

pp = sns.pairplot(tdata[['dTime', 'aDelay', 'dDelay', 'distance', 'DoW']], 
                  hue='DoW', palette="Blues_d", diag_kind="kde")

axes = pp.axes

xlim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
ylim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]

for i in range(len(xlim)):
    for j in range(len(ylim)):
        axes[i, j].set_xlim(xlim[j])
        axes[i, j].set_ylim(ylim[i])

In this case, the additional colored day information didn't produce any new insights. Sometimes this is due to the small size of the plots. In this case, we can try to make a larger plot that might gain clarity by distinguishing data from different days of the week. In the following plot, we focus on the relationship between departure and arrival delays, but colored by day of the week. While no obvious difference are noted here, the basic idea is an important one to learn.



In [8]:

    
# Lets specify a color pallete by using a runtime context.
with sns.color_palette("Blues_d", 7):
    
    # Make a scatter plot with no regression fit
    lviz = sns.lmplot('dDelay', 'aDelay', tdata, hue='Day', size = 8, fit_reg=False)
    
    # Change the axes limits

    ltmp = lviz.set(xlim=(-30, 70), ylim=(-45, 70))

# Make plot visually less cluttered
sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

Box Plots

In some case, we simply want to compare aggregate distributions, for example, the typical departure delay as a function of day. In this case, we can use a box plot to generate distributions for the full data, as a function of day (in this case). The box plot display a box with upper and lower edges that encapsulate the inner two quartiles. A box plot also has a line through the box that indicates the mean value, and whiskers that extend to the box to indicate the typical extent of the majority of data. Points that are considered extreme outliers are displayed outside the whiskers. The plot below demonstrates the variation in departure delay as a function of the day of the week.



In [9]:

    
sns.boxplot(tdata.dDelay, tdata.Day)









    Out[9]:





<matplotlib.axes.AxesSubplot at 0x7f9111c35748>

Violin Plot

Another useful visualization tool is the violin plot, which is similar to the box plot, but rather than a box which indicates the quartiles and a mean value, the violin plot changes shape based on the actual distribution of the data. A squatter shape indicates a more compact data distribution, which a longer shape indicates a more varied distribution. The violin plot also displays the full range of the data, as shown below (where we see Thursday's have long departure delays on average).



In [10]:

    
sns.violinplot(tdata.dDelay, tdata.Day)









    Out[10]:





<matplotlib.axes.AxesSubplot at 0x7f9111c035f8>

Histograms

Histograms are a great way to visually summarize information. But rather than always overplotting similar histograms, we can use the Seaborn FacetGrid object to compare histograms. We demonstrate this below, by first comparing the distances of different flights for different days, and second by comparing the departure delays for different days.



In [11]:

    
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(tdata[tdata.distance < 3000], row="DoW", hue="DoW", palette="Blues_d",
                  size=1.7, aspect=4) #, hue_order=days, row_order=days)
viz.map(plt.hist, "distance", bins=30)









    Out[11]:





<seaborn.axisgrid.FacetGrid at 0x7f9111dda940>



In [12]:

    
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(tdata[(tdata.dDelay < 30) & (tdata.dDelay > 0)], row="DoW", hue="DoW", palette="Blues_d",
                  size=1.7, aspect=4)
viz.map(plt.hist, "dDelay", bins=10)









    Out[12]:





<seaborn.axisgrid.FacetGrid at 0x7f9111daf668>

Sometimes small changes can improve the visual clarity of a result. Here we first only select data for the Atlanta airport. We also increase the number of bins and data range, which highlights the differences and similarities between departure delays for different days of the week. We also use these distributions to model the data by using regression to quantify these differences statistically, for example by using the Gamma distribution.



In [13]:

    
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(data[(data.dDelay < 90) & (data.aDelay > 0) & data.depart.str.contains('ATL') & 
                         (data.month < 3)], row="Day", hue="Day", palette="Blues_d",
                  size=1.7, aspect=4)
viz.map(plt.hist, "dDelay", bins=30, normed=False)









    Out[13]:





<seaborn.axisgrid.FacetGrid at 0x7f9111d6af28>

Density Plots

For large data sets, it is often more illustrative to see the density of points. We demonstrate this below by using the Seaborn jointplot method. This creates a binned version of the data, in the example below this is done using a hexagonal binning. The jointplot also displays the marginal distributions along each dimension for further clarity.



In [14]:

    
jp = sns.jointplot('dDelay', 'aDelay', tdata[tdata.aDelay > 5], kind="hex", 
                   stat_func=None, color="#8855AA", xlim=(5, 50), ylim=(5, 50))

Earlier we tried to distinguish the different days of the week data on the same plot. In some cases, it is easier to simply plot the data in different figures that are side-by-side. We can do this in Seaborn by using a FacetGrid and indicating which DataFrame column should be used to identify the correct subplot for the current row. We also can wrap the figures to limit the width. Below we plot the departure versus arrival delays for each data separately, wrapping the subplots to three-columns. We also color each subplot differently.



In [15]:

    
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(data[::100], col="Day", col_wrap=3, hue="Day", palette="Blues_d", size=4)
viz.map(sns.regplot, 'aDelay', 'dDelay')
viz.set(xlim = (-20, 50), ylim=(-20, 50))









    Out[15]:





<seaborn.axisgrid.FacetGrid at 0x7f910e9480f0>

HeatMaps

We also can compare aggregate statistics, for example, mean departure delay, as a function of two attributes. To create this visualization, we first need to create a pivot table from our DataFrame. To do this, we group our data by month and Day, compute the mean on the grouped data, and display the result as shown in the next code cell.



In [16]:

    
# Lets make a Pivot Table

# First we group the data by Month and Day
df = tdata.groupby(['month', 'Day'])
dd = df.mean()
dd.head()









    Out[16]:






  
    
      
      
      dTime
      aDelay
      dDelay
      distance
    
    
      month
      Day
      
      
      
      
    
  
  
    
      1
      1
       1323.277778
      -1.888889
        1.666667
       629.555556
    
    
      2
       1217.666667
       6.066667
        8.933333
       594.533333
    
    
      3
       1304.888889
      -6.166667
       -1.277778
       772.277778
    
    
      4
       1565.571429
       7.071429
        9.571429
       553.428571
    
    
      5
       1538.600000
       7.133333
       10.000000
       821.400000
    
  

5 rows × 4 columns

This new DataFrame has a multi-index, where two columns have been used to index rows, as opposed to the standard single column index. Before we can make a pivot table, we first need to reset the index. We next change the departure delay to an integer value, which will display more nicely than a floating point value in our heatmap. We finally pivot the data to make the pivot table before displaying the result as a heatmap.



In [17]:

    
# Now we rest the Index (to make a regular DataFrame)
# and convert the dDleya column to integer

dd.reset_index(inplace=True)  
dd['dDelay'] = dd['dDelay'].astype(int)

# Now we pivot the DataFrame to make a Matrix with values encoded.
dp = dd.pivot('month', 'Day', 'dDelay')



In [18]:

    
# Now we can plot the heatmap with values in place

f = plt.figure(figsize=(8, 10))

sns.heatmap(dp, annot=True, fmt='d')


#sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

We can also use Seaborn to make a scatter plot of data that is colored by a third attribute, in this case day of the week. We also can have Seaborn calculate a linear regression to the data to indicate potential differences. We do this below to compare departure and arrival delays for Thursday and Friday.



In [19]:

    
# Make a scatter plot that is colored by day.
# Also fit a linear regression to each day.

lviz = sns.lmplot('dDelay', 'aDelay', 
                  tdata[(tdata.aDelay > 1) & ((tdata.Day == 5) | (tdata.Day == 6))], 
                  hue='Day', size = 8, ci=68)

ltmp = lviz.set(xlim=(-30, 70), ylim=(-30, 70))

sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

One issue to be wary of is extrapolating too much from a limited data set. We can demonstrate this by making a heatmap that compares the mean departure delay as a function of day of the week and month for the entire flights data. We do this below, by first grouping the full data set by month and Day, calculating the mean, and pivoting the DataFrame. We use this new pivot table to make the Seaborn heatmap, which does look different than the similar heatmap derived from a limited amount of data.



In [20]:

    
# Now we rest the Index (to make a regular DataFrame)
# and convert the dDelay column to integer

dff = data.groupby(['month', 'Day'])

# Can change the statistical function to one of min, max, sum, mean, std
# 9/11/01 was a Tuesday.

ddf = dff.mean()

ddf.reset_index(inplace=True)  
ddf['dDelay'] = ddf['dDelay'].astype(int)

# Now we pivot the DataFrame to make a Matrix with values encoded.
dpf = ddf.pivot('month', 'Day', 'dDelay')



In [21]:

    
# Now we can plot the heatmap with values in place

f = plt.figure(figsize=(8, 10))

sns.heatmap(dpf, annot=True, fmt='d')


#sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

	month	Day	dTime	aDelay	dDelay	distance
count	5723673.000000	5723673.000000	5723673.000000	5723673.000000	5723673.000000	5723673.000000
mean	6.291581	3.949829	1348.688044	5.528249	8.115272	735.173682
std	3.381755	1.997942	482.638757	31.429291	28.234083	574.815182
min	1.000000	1.000000	1.000000	-1116.000000	-204.000000	21.000000
25%	3.000000	2.000000	930.000000	-9.000000	-3.000000	314.000000
50%	6.000000	4.000000	1333.000000	-2.000000	0.000000	575.000000
75%	9.000000	6.000000	1740.000000	10.000000	6.000000	983.000000
max	12.000000	7.000000	2400.000000	1688.000000	1692.000000	4962.000000

		dTime	aDelay	dDelay	distance
month	Day
1	1	1323.277778	-1.888889	1.666667	629.555556
	2	1217.666667	6.066667	8.933333	594.533333
	3	1304.888889	-6.166667	-1.277778	772.277778
	4	1565.571429	7.071429	9.571429	553.428571
	5	1538.600000	7.133333	10.000000	821.400000