In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os

Downloaded a few comma-delimited data from the United States Department of Transportation, Bureau of Transportation Statistics website.
This data ranges from the months of Janurary - May in 2016.

In this specific example, I collected data that had either DEST_CITY_NAME or ORGIN_CITY_NAME related to California. What does this mean? It means that a specific flight had either departed or landed in a California based airport.

You can use filters to identify a specific dataset you want to look at.
You can download your own source here



In [118]:

    
# Assign a list of available files in my data directory (../data/2016/) to a variable
files = os.listdir("data/2016");

# Display
files









    Out[118]:





['162820111_T_ONTIME-APRIL.csv',
 '162820111_T_ONTIME-FEB.csv',
 '162820111_T_ONTIME-JAN.csv',
 '162820111_T_ONTIME-MARCH.csv',
 '162820111_T_ONTIME-MAY.csv']



In [119]:

    
# Read through all files and concat all df into a single dataframe, df.
framelist = []

for file in files:
    tempdf = pd.read_csv("data/2016/" + file)
    framelist.append(tempdf)
    
df = pd.concat(framelist);









    



//anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2705: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Let's look at some basic info with the dataset.



In [4]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 485856 entries, 0 to 103365
Data columns (total 22 columns):
YEAR                     485856 non-null int64
MONTH                    485856 non-null int64
UNIQUE_CARRIER           485856 non-null object
AIRLINE_ID               485856 non-null int64
ORIGIN_AIRPORT_ID        485856 non-null int64
ORIGIN_AIRPORT_SEQ_ID    485856 non-null int64
ORIGIN_CITY_MARKET_ID    485856 non-null int64
ORIGIN_CITY_NAME         485856 non-null object
DEST_AIRPORT_ID          485856 non-null int64
DEST_AIRPORT_SEQ_ID      485856 non-null int64
DEST_CITY_MARKET_ID      485856 non-null int64
DEST_CITY_NAME           485856 non-null object
ARR_DELAY_NEW            479829 non-null float64
CANCELLED                485856 non-null float64
CANCELLATION_CODE        5125 non-null object
AIR_TIME                 479829 non-null float64
CARRIER_DELAY            85180 non-null float64
WEATHER_DELAY            85180 non-null float64
NAS_DELAY                85180 non-null float64
SECURITY_DELAY           85180 non-null float64
LATE_AIRCRAFT_DELAY      85180 non-null float64
Unnamed: 21              0 non-null float64
dtypes: float64(9), int64(9), object(4)
memory usage: 85.3+ MB



In [5]:

    
df.head()









    Out[5]:






  
    
      
      YEAR
      MONTH
      UNIQUE_CARRIER
      AIRLINE_ID
      ORIGIN_AIRPORT_ID
      ORIGIN_AIRPORT_SEQ_ID
      ORIGIN_CITY_MARKET_ID
      ORIGIN_CITY_NAME
      DEST_AIRPORT_ID
      DEST_AIRPORT_SEQ_ID
      ...
      ARR_DELAY_NEW
      CANCELLED
      CANCELLATION_CODE
      AIR_TIME
      CARRIER_DELAY
      WEATHER_DELAY
      NAS_DELAY
      SECURITY_DELAY
      LATE_AIRCRAFT_DELAY
      Unnamed: 21
    
  
  
    
      0
      2016
      4
      AA
      19805
      12478
      1247803
      31703
      New York, NY
      12892
      1289203
      ...
      0.0
      0.0
      NaN
      329.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2016
      4
      AA
      19805
      12478
      1247803
      31703
      New York, NY
      12892
      1289203
      ...
      0.0
      0.0
      NaN
      329.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      2016
      4
      AA
      19805
      12478
      1247803
      31703
      New York, NY
      12892
      1289203
      ...
      68.0
      0.0
      NaN
      339.0
      33.0
      0.0
      7.0
      0.0
      28.0
      NaN
    
    
      3
      2016
      4
      AA
      19805
      12478
      1247803
      31703
      New York, NY
      12892
      1289203
      ...
      0.0
      0.0
      NaN
      345.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      2016
      4
      AA
      19805
      12478
      1247803
      31703
      New York, NY
      12892
      1289203
      ...
      22.0
      0.0
      NaN
      350.0
      0.0
      0.0
      22.0
      0.0
      0.0
      NaN
    
  

5 rows × 22 columns



In [6]:

    
# Drop the last column in place, 'Unnamed: 21', which is the -1 index
df.drop(df.columns[[-1]], axis=1, inplace=True)

Lets see what different airlines we have data for.



In [7]:

    
df.UNIQUE_CARRIER.unique()









    Out[7]:





array(['AA', 'AS', 'B6', 'DL', 'F9', 'HA', 'NK', 'OO', 'UA', 'VX', 'WN'], dtype=object)

I retrieved this information from a lookup table from the transtats website, let's assign it to a key/value dict for later use.



In [8]:

    
airlinekeys = {'AA': 'American Airlines Inc',
               'AS': 'Alaska Airlines Inc.',
               'B6': 'JetBlue Airways',
               'DL': 'Delta Airlines Inc.',
               'F9': 'Frontier Airlines Inc',
               'HA': 'Hawaiian Airlines Inc',
               'NK': 'Spirit Airlines',
               'OO': 'SkyWest Airlines Inc',
               'UA': 'United Airlines Inc.',
               'VX': 'Virgin America',
               'WN': 'Southwest Airlines Co' }

pd.DataFrame.from_dict(airlinekeys, orient="index")









    Out[8]:






  
    
      
      0
    
  
  
    
      VX
      Virgin America
    
    
      DL
      Delta Airlines Inc.
    
    
      NK
      Spirit Airlines
    
    
      F9
      Frontier Airlines Inc
    
    
      AA
      American Airlines Inc
    
    
      B6
      JetBlue Airways
    
    
      HA
      Hawaiian Airlines Inc
    
    
      UA
      United Airlines Inc.
    
    
      OO
      SkyWest Airlines Inc
    
    
      WN
      Southwest Airlines Co
    
    
      AS
      Alaska Airlines Inc.

Amount of flights recorded



In [120]:

    
# Display frequency of each airline recorded.
df['UNIQUE_CARRIER'].value_counts()









    Out[120]:





WN    138315
OO     72990
UA     70824
AA     67302
DL     47548
AS     29289
VX     24433
B6     17043
NK      9736
F9      4574
HA      3802
Name: UNIQUE_CARRIER, dtype: int64

So, as you can see here, 'WN' (Southwest) services the most flights IN/OUT of California. 'HA' (Hawaiian Airlines) has the fewest.

Lets visualize that.



In [122]:

    
fig = plt.figure(figsize=(20,10))

df.UNIQUE_CARRIER.value_counts().plot(kind='barh')

plt.ylabel('Frequency', fontsize=15); plt.xlabel('Airline', fontsize=15)
plt.tick_params(axis='both', labelsize=15)
plt.title('Amount of flights recorded per airline (2016)', fontsize=15)









    Out[122]:





<matplotlib.text.Text at 0x118856b00>

Delay Statistics



In [125]:

    
# Function to return a dictionary of all delays greater than 0 per carrier

def delayratio(carrier):
    
    carrier = carrier.upper()
    
    total = (df.ix[(df['UNIQUE_CARRIER'] == carrier)]).shape[0]
    delays = (df.ix[(df['UNIQUE_CARRIER'] == carrier ) & (df['ARR_DELAY_NEW'] > 0 )]).shape[0]

    return({ 'Airline': carrier, 'Total': total, 'Delays': delays })



In [128]:

    
carrierlist = list(df.UNIQUE_CARRIER.unique())

print(carrierlist)









    



['AA', 'AS', 'B6', 'DL', 'F9', 'HA', 'NK', 'OO', 'UA', 'VX', 'WN']



In [130]:

    
dflist = []

for val, item in enumerate(carrierlist):
    dflist.append(delayratio(carrierlist[val]))

dflist









    Out[130]:





[{'Airline': 'AA', 'Delays': 23258, 'Total': 67302},
 {'Airline': 'AS', 'Delays': 9412, 'Total': 29289},
 {'Airline': 'B6', 'Delays': 6750, 'Total': 17043},
 {'Airline': 'DL', 'Delays': 14263, 'Total': 47548},
 {'Airline': 'F9', 'Delays': 1389, 'Total': 4574},
 {'Airline': 'HA', 'Delays': 1276, 'Total': 3802},
 {'Airline': 'NK', 'Delays': 4701, 'Total': 9736},
 {'Airline': 'OO', 'Delays': 25704, 'Total': 72990},
 {'Airline': 'UA', 'Delays': 19553, 'Total': 70824},
 {'Airline': 'VX', 'Delays': 9989, 'Total': 24433},
 {'Airline': 'WN', 'Delays': 56518, 'Total': 138315}]



In [131]:

    
# Let's put the list of dicts into a dataframe
delayratiodf = pd.DataFrame(dflist)

# Let's set the index of the dataframe to be 'Airline'
delayratiodf.set_index('Airline', inplace=True)

# Let's use the airlinekey dictionary we made earlier to replace the airline codenames
delayratiodf.rename(index=airlinekeys, inplace=True)



In [132]:

    
delayratiodf









    Out[132]:






  
    
      
      Delays
      Total
    
    
      Airline
      
      
    
  
  
    
      American Airlines Inc
      23258
      67302
    
    
      Alaska Airlines Inc.
      9412
      29289
    
    
      JetBlue Airways
      6750
      17043
    
    
      Delta Airlines Inc.
      14263
      47548
    
    
      Frontier Airlines Inc
      1389
      4574
    
    
      Hawaiian Airlines Inc
      1276
      3802
    
    
      Spirit Airlines
      4701
      9736
    
    
      SkyWest Airlines Inc
      25704
      72990
    
    
      United Airlines Inc.
      19553
      70824
    
    
      Virgin America
      9989
      24433
    
    
      Southwest Airlines Co
      56518
      138315



In [145]:

    
# Lets sort by total flights
delayratiodf.sort_values('Total', inplace=True)

# Create a stacked barchart
plot = delayratiodf[['Delays', 'Total']].plot(kind='barh', figsize=(20,15), legend=True, fontsize=20, color=['r', 'b'])

# Increase the legend size
plot.legend(loc=4, prop={'size':20})









    Out[145]:





<matplotlib.legend.Legend at 0x115d52b70>

Frequency of delays



In [146]:

    
# Simple function to determine ratio
def ratio(df):
    
    return float("%.2f" % ((df[0]/df[1])*100))



In [147]:

    
# Create a new column 'Percentage', and apply the function 'ratio' to the data.
delayratiodf['Percentage'] = delayratiodf.apply(ratio, axis=1)

# Sort values again by percentage
delayratiodf.sort_values('Percentage', inplace=True)



In [148]:

    
delayratiodf









    Out[148]:






  
    
      
      Delays
      Total
      Percentage
    
    
      Airline
      
      
      
    
  
  
    
      United Airlines Inc.
      19553
      70824
      27.61
    
    
      Delta Airlines Inc.
      14263
      47548
      30.00
    
    
      Frontier Airlines Inc
      1389
      4574
      30.37
    
    
      Alaska Airlines Inc.
      9412
      29289
      32.13
    
    
      Hawaiian Airlines Inc
      1276
      3802
      33.56
    
    
      American Airlines Inc
      23258
      67302
      34.56
    
    
      SkyWest Airlines Inc
      25704
      72990
      35.22
    
    
      JetBlue Airways
      6750
      17043
      39.61
    
    
      Southwest Airlines Co
      56518
      138315
      40.86
    
    
      Virgin America
      9989
      24433
      40.88
    
    
      Spirit Airlines
      4701
      9736
      48.28

According to this, Spirit Airlines is the worst performing in terms of delay frequency at 48% of the time, and United Airlines has the lowest chance, 27.61%

Let's visualize that



In [149]:

    
ax = delayratiodf.plot(y='Percentage', kind='barh', figsize=(20,10), title='Percntage of delayed flights', color='red')

plt.ylabel('Percentage', fontsize=15); plt.xlabel('Airline', fontsize=20)
plt.tick_params(axis='both', labelsize=20)

for p in ax.patches:
    ax.annotate("%.2f" % p.get_width(), (p.get_x() + p.get_width(), p.get_y()), xytext=(5, 10), textcoords='offset points')

To be continued...



In [ ]:

	YEAR	MONTH	UNIQUE_CARRIER	AIRLINE_ID	ORIGIN_AIRPORT_ID	ORIGIN_AIRPORT_SEQ_ID	ORIGIN_CITY_MARKET_ID	ORIGIN_CITY_NAME	DEST_AIRPORT_ID	DEST_AIRPORT_SEQ_ID	...	ARR_DELAY_NEW	CANCELLATION_CODE	AIR_TIME	CARRIER_DELAY	WEATHER_DELAY	NAS_DELAY	SECURITY_DELAY	LATE_AIRCRAFT_DELAY	Unnamed: 21
0	2016	4	AA	19805	12478	1247803	31703	New York, NY	12892	1289203	...	0.0	NaN	329.0	NaN	NaN	NaN	NaN	NaN	NaN
1	2016	4	AA	19805	12478	1247803	31703	New York, NY	12892	1289203	...	0.0	NaN	329.0	NaN	NaN	NaN	NaN	NaN	NaN
2	2016	4	AA	19805	12478	1247803	31703	New York, NY	12892	1289203	...	68.0	NaN	339.0	33.0	0.0	7.0	0.0	28.0	NaN
3	2016	4	AA	19805	12478	1247803	31703	New York, NY	12892	1289203	...	0.0	NaN	345.0	NaN	NaN	NaN	NaN	NaN	NaN
4	2016	4	AA	19805	12478	1247803	31703	New York, NY	12892	1289203	...	22.0	NaN	350.0	0.0	0.0	22.0	0.0	0.0	NaN

	0
VX	Virgin America
DL	Delta Airlines Inc.
NK	Spirit Airlines
F9	Frontier Airlines Inc
AA	American Airlines Inc
B6	JetBlue Airways
HA	Hawaiian Airlines Inc
UA	United Airlines Inc.
OO	SkyWest Airlines Inc
WN	Southwest Airlines Co
AS	Alaska Airlines Inc.

	Delays	Total
Airline
American Airlines Inc	23258	67302
Alaska Airlines Inc.	9412	29289
JetBlue Airways	6750	17043
Delta Airlines Inc.	14263	47548
Frontier Airlines Inc	1389	4574
Hawaiian Airlines Inc	1276	3802
Spirit Airlines	4701	9736
SkyWest Airlines Inc	25704	72990
United Airlines Inc.	19553	70824
Virgin America	9989	24433
Southwest Airlines Co	56518	138315