In last notebook we explored our data, and in order to simplify subsequent tasks, we created a trimmed subset of our data. Once we have a clean data set, the next step in the data exporation process is to gain a better understanding of our data set. The easiest way to do this is to visually explore the data. In this Notebook, we use several visualization techniques in the Seaborn library to learn more about the flights data.
First we need to read the data into this Notebook. To simplify the process, we can simply read in the data from our HSF file. We quickly check the data over before we generate a trimmed data set (by slicing every 5000 row) before we begin to make visualizations.
In [1]:
import pandas as pd
data = pd.read_hdf('sflights.h5', 'table')
In [2]:
data.describe()
Out[2]:
In [3]:
# Lets generate a trimmed data set to speed up exploration
tdata = data[::5000]
print(len(tdata))
print(tdata.head())
We can test the version of Seaborn we have by using the built-in help
method to display the basic info about the Seaborn module. While you can
do this in your own Notebook, my container has Seaborn version 0.5.1
installed. This is expected, however, since according to the class
Dockerfile, seaborn was installed by using pip3 which retrieves the
latest version of the module from the python package index.
The next step is to make a pair plot that allos an easy visual
comparison of the different dimensions in our data set. We do this in
Seaborn by using the PairGrid object to make scatter plots between
each dimension.
In [4]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
pg = sns.PairGrid(tdata)
pg.map(plt.scatter)
We can select specific columns to sue to make pair plots by using
Seaborn. For example in the previous set of plots, the month and Day
columns do not visually provide significant information. We can remove
those and plot the data of interest. In this case, we also explicitly
set the axis limits of our plot to help highlight visual trends.
In [5]:
pg = sns.PairGrid(tdata[['dTime', 'aDelay', 'dDelay', 'distance']])
pg.map_diag(plt.hist)
pg.map_offdiag(plt.scatter)
# Lets explicitly set the axes limits
axes = pg.axes
xlim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
ylim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
for i in range(len(xlim)):
for j in range(len(ylim)):
axes[i, j].set_xlim(xlim[j])
axes[i, j].set_ylim(ylim[i])
The previous plots showed the aggregate data, but we can also differentiate the data by day of the week. To do this, we simply add a new column that contains the string listing the name (as opposed to the numerical value). We can add this column to our small data set by using a lambda function to use the numerical day as an index into our list of strings. Note that we obviously should verify this linear mapping, assumptions like this if incorrect can cause significant problems later.
In [6]:
# First we create a simple list of the week day names
dow = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
# Now we add add a column byy converting from the int value to an index into our list.
tdata['DoW'] = tdata['Day'].apply(lambda x: dow[int(x - 1)])
In [7]:
# We can use the shorthand pairplot function
pp = sns.pairplot(tdata[['dTime', 'aDelay', 'dDelay', 'distance', 'DoW']],
hue='DoW', palette="Blues_d", diag_kind="kde")
axes = pp.axes
xlim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
ylim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
for i in range(len(xlim)):
for j in range(len(ylim)):
axes[i, j].set_xlim(xlim[j])
axes[i, j].set_ylim(ylim[i])
In this case, the additional colored day information didn't produce any new insights. Sometimes this is due to the small size of the plots. In this case, we can try to make a larger plot that might gain clarity by distinguishing data from different days of the week. In the following plot, we focus on the relationship between departure and arrival delays, but colored by day of the week. While no obvious difference are noted here, the basic idea is an important one to learn.
In [8]:
# Lets specify a color pallete by using a runtime context.
with sns.color_palette("Blues_d", 7):
# Make a scatter plot with no regression fit
lviz = sns.lmplot('dDelay', 'aDelay', tdata, hue='Day', size = 8, fit_reg=False)
# Change the axes limits
ltmp = lviz.set(xlim=(-30, 70), ylim=(-45, 70))
# Make plot visually less cluttered
sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)
In some case, we simply want to compare aggregate distributions, for example, the typical departure delay as a function of day. In this case, we can use a box plot to generate distributions for the full data, as a function of day (in this case). The box plot display a box with upper and lower edges that encapsulate the inner two quartiles. A box plot also has a line through the box that indicates the mean value, and whiskers that extend to the box to indicate the typical extent of the majority of data. Points that are considered extreme outliers are displayed outside the whiskers. The plot below demonstrates the variation in departure delay as a function of the day of the week.
In [9]:
sns.boxplot(tdata.dDelay, tdata.Day)
Out[9]:
Another useful visualization tool is the violin plot, which is similar to the box plot, but rather than a box which indicates the quartiles and a mean value, the violin plot changes shape based on the actual distribution of the data. A squatter shape indicates a more compact data distribution, which a longer shape indicates a more varied distribution. The violin plot also displays the full range of the data, as shown below (where we see Thursday's have long departure delays on average).
In [10]:
sns.violinplot(tdata.dDelay, tdata.Day)
Out[10]:
Histograms are a great way to visually summarize information. But rather
than always overplotting similar histograms, we can use the Seaborn
FacetGrid object to compare histograms. We demonstrate this below, by
first comparing the distances of different flights for different days,
and second by comparing the departure delays for different days.
In [11]:
# Doing the basic makes thing hards to see.
viz = sns.FacetGrid(tdata[tdata.distance < 3000], row="DoW", hue="DoW", palette="Blues_d",
size=1.7, aspect=4) #, hue_order=days, row_order=days)
viz.map(plt.hist, "distance", bins=30)
Out[11]:
In [12]:
# Doing the basic makes thing hards to see.
viz = sns.FacetGrid(tdata[(tdata.dDelay < 30) & (tdata.dDelay > 0)], row="DoW", hue="DoW", palette="Blues_d",
size=1.7, aspect=4)
viz.map(plt.hist, "dDelay", bins=10)
Out[12]:
Sometimes small changes can improve the visual clarity of a result. Here we first only select data for the Atlanta airport. We also increase the number of bins and data range, which highlights the differences and similarities between departure delays for different days of the week. We also use these distributions to model the data by using regression to quantify these differences statistically, for example by using the Gamma distribution.
In [13]:
# Doing the basic makes thing hards to see.
viz = sns.FacetGrid(data[(data.dDelay < 90) & (data.aDelay > 0) & data.depart.str.contains('ATL') &
(data.month < 3)], row="Day", hue="Day", palette="Blues_d",
size=1.7, aspect=4)
viz.map(plt.hist, "dDelay", bins=30, normed=False)
Out[13]:
For large data sets, it is often more illustrative to see the density of
points. We demonstrate this below by using the Seaborn jointplot
method. This creates a binned version of the data, in the example below
this is done using a hexagonal binning. The jointplot also displays
the marginal distributions along each dimension for further clarity.
In [14]:
jp = sns.jointplot('dDelay', 'aDelay', tdata[tdata.aDelay > 5], kind="hex",
stat_func=None, color="#8855AA", xlim=(5, 50), ylim=(5, 50))
Earlier we tried to distinguish the different days of the week data on
the same plot. In some cases, it is easier to simply plot the data in
different figures that are side-by-side. We can do this in Seaborn by
using a FacetGrid and indicating which DataFrame column should be used
to identify the correct subplot for the current row. We also can wrap
the figures to limit the width. Below we plot the departure versus
arrival delays for each data separately, wrapping the subplots to
three-columns. We also color each subplot differently.
In [15]:
# Doing the basic makes thing hards to see.
viz = sns.FacetGrid(data[::100], col="Day", col_wrap=3, hue="Day", palette="Blues_d", size=4)
viz.map(sns.regplot, 'aDelay', 'dDelay')
viz.set(xlim = (-20, 50), ylim=(-20, 50))
Out[15]:
We also can compare aggregate statistics, for example, mean departure
delay, as a function of two attributes. To create this visualization, we
first need to create a pivot table from our DataFrame. To do this, we
group our data by month and Day, compute the mean on the grouped
data, and display the result as shown in the next code cell.
In [16]:
# Lets make a Pivot Table
# First we group the data by Month and Day
df = tdata.groupby(['month', 'Day'])
dd = df.mean()
dd.head()
Out[16]:
This new DataFrame has a multi-index, where two columns have been used to index rows, as opposed to the standard single column index. Before we can make a pivot table, we first need to reset the index. We next change the departure delay to an integer value, which will display more nicely than a floating point value in our heatmap. We finally pivot the data to make the pivot table before displaying the result as a heatmap.
In [17]:
# Now we rest the Index (to make a regular DataFrame)
# and convert the dDleya column to integer
dd.reset_index(inplace=True)
dd['dDelay'] = dd['dDelay'].astype(int)
# Now we pivot the DataFrame to make a Matrix with values encoded.
dp = dd.pivot('month', 'Day', 'dDelay')
In [18]:
# Now we can plot the heatmap with values in place
f = plt.figure(figsize=(8, 10))
sns.heatmap(dp, annot=True, fmt='d')
#sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)
We can also use Seaborn to make a scatter plot of data that is colored by a third attribute, in this case day of the week. We also can have Seaborn calculate a linear regression to the data to indicate potential differences. We do this below to compare departure and arrival delays for Thursday and Friday.
In [19]:
# Make a scatter plot that is colored by day.
# Also fit a linear regression to each day.
lviz = sns.lmplot('dDelay', 'aDelay',
tdata[(tdata.aDelay > 1) & ((tdata.Day == 5) | (tdata.Day == 6))],
hue='Day', size = 8, ci=68)
ltmp = lviz.set(xlim=(-30, 70), ylim=(-30, 70))
sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)
One issue to be wary of is extrapolating too much from a limited data
set. We can demonstrate this by making a heatmap that compares the mean
departure delay as a function of day of the week and month for the
entire flights data. We do this below, by first grouping the full data
set by month and Day, calculating the mean, and pivoting the
DataFrame. We use this new pivot table to make the Seaborn heatmap,
which does look different than the similar heatmap derived from a
limited amount of data.
In [20]:
# Now we rest the Index (to make a regular DataFrame)
# and convert the dDelay column to integer
dff = data.groupby(['month', 'Day'])
# Can change the statistical function to one of min, max, sum, mean, std
# 9/11/01 was a Tuesday.
ddf = dff.mean()
ddf.reset_index(inplace=True)
ddf['dDelay'] = ddf['dDelay'].astype(int)
# Now we pivot the DataFrame to make a Matrix with values encoded.
dpf = ddf.pivot('month', 'Day', 'dDelay')
In [21]:
# Now we can plot the heatmap with values in place
f = plt.figure(figsize=(8, 10))
sns.heatmap(dpf, annot=True, fmt='d')
#sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)