In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
Downloaded a few comma-delimited data from the United States Department of Transportation, Bureau of Transportation Statistics website.
This data ranges from the months of Janurary - May in 2016.
In this specific example, I collected data that had either DEST_CITY_NAME or ORGIN_CITY_NAME related to California. What does this mean? It means that a specific flight had either departed or landed in a California based airport.
You can use filters to identify a specific dataset you want to look at.
You can download your own source here
In [118]:
# Assign a list of available files in my data directory (../data/2016/) to a variable
files = os.listdir("data/2016");
# Display
files
Out[118]:
In [119]:
# Read through all files and concat all df into a single dataframe, df.
framelist = []
for file in files:
tempdf = pd.read_csv("data/2016/" + file)
framelist.append(tempdf)
df = pd.concat(framelist);
Let's look at some basic info with the dataset.
In [4]:
df.info()
In [5]:
df.head()
Out[5]:
In [6]:
# Drop the last column in place, 'Unnamed: 21', which is the -1 index
df.drop(df.columns[[-1]], axis=1, inplace=True)
Lets see what different airlines we have data for.
In [7]:
df.UNIQUE_CARRIER.unique()
Out[7]:
I retrieved this information from a lookup table from the transtats website, let's assign it to a key/value dict for later use.
In [8]:
airlinekeys = {'AA': 'American Airlines Inc',
'AS': 'Alaska Airlines Inc.',
'B6': 'JetBlue Airways',
'DL': 'Delta Airlines Inc.',
'F9': 'Frontier Airlines Inc',
'HA': 'Hawaiian Airlines Inc',
'NK': 'Spirit Airlines',
'OO': 'SkyWest Airlines Inc',
'UA': 'United Airlines Inc.',
'VX': 'Virgin America',
'WN': 'Southwest Airlines Co' }
pd.DataFrame.from_dict(airlinekeys, orient="index")
Out[8]:
In [120]:
# Display frequency of each airline recorded.
df['UNIQUE_CARRIER'].value_counts()
Out[120]:
So, as you can see here, 'WN' (Southwest) services the most flights IN/OUT of California. 'HA' (Hawaiian Airlines) has the fewest.
Lets visualize that.
In [122]:
fig = plt.figure(figsize=(20,10))
df.UNIQUE_CARRIER.value_counts().plot(kind='barh')
plt.ylabel('Frequency', fontsize=15); plt.xlabel('Airline', fontsize=15)
plt.tick_params(axis='both', labelsize=15)
plt.title('Amount of flights recorded per airline (2016)', fontsize=15)
Out[122]:
In [125]:
# Function to return a dictionary of all delays greater than 0 per carrier
def delayratio(carrier):
carrier = carrier.upper()
total = (df.ix[(df['UNIQUE_CARRIER'] == carrier)]).shape[0]
delays = (df.ix[(df['UNIQUE_CARRIER'] == carrier ) & (df['ARR_DELAY_NEW'] > 0 )]).shape[0]
return({ 'Airline': carrier, 'Total': total, 'Delays': delays })
In [128]:
carrierlist = list(df.UNIQUE_CARRIER.unique())
print(carrierlist)
In [130]:
dflist = []
for val, item in enumerate(carrierlist):
dflist.append(delayratio(carrierlist[val]))
dflist
Out[130]:
In [131]:
# Let's put the list of dicts into a dataframe
delayratiodf = pd.DataFrame(dflist)
# Let's set the index of the dataframe to be 'Airline'
delayratiodf.set_index('Airline', inplace=True)
# Let's use the airlinekey dictionary we made earlier to replace the airline codenames
delayratiodf.rename(index=airlinekeys, inplace=True)
In [132]:
delayratiodf
Out[132]:
In [145]:
# Lets sort by total flights
delayratiodf.sort_values('Total', inplace=True)
# Create a stacked barchart
plot = delayratiodf[['Delays', 'Total']].plot(kind='barh', figsize=(20,15), legend=True, fontsize=20, color=['r', 'b'])
# Increase the legend size
plot.legend(loc=4, prop={'size':20})
Out[145]:
In [146]:
# Simple function to determine ratio
def ratio(df):
return float("%.2f" % ((df[0]/df[1])*100))
In [147]:
# Create a new column 'Percentage', and apply the function 'ratio' to the data.
delayratiodf['Percentage'] = delayratiodf.apply(ratio, axis=1)
# Sort values again by percentage
delayratiodf.sort_values('Percentage', inplace=True)
In [148]:
delayratiodf
Out[148]:
According to this, Spirit Airlines is the worst performing in terms of delay frequency at 48% of the time, and United Airlines has the lowest chance, 27.61%
Let's visualize that
In [149]:
ax = delayratiodf.plot(y='Percentage', kind='barh', figsize=(20,10), title='Percntage of delayed flights', color='red')
plt.ylabel('Percentage', fontsize=15); plt.xlabel('Airline', fontsize=20)
plt.tick_params(axis='both', labelsize=20)
for p in ax.patches:
ax.annotate("%.2f" % p.get_width(), (p.get_x() + p.get_width(), p.get_y()), xytext=(5, 10), textcoords='offset points')
In [ ]: