In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [2]:

    
from scipy import stats

EDA: Car Insurance Cold Calls

Explore the data

The data set we will use for this exercise comes from the featured Kaggle datasets, containing data about cold calls to both existing and potential car insurance customers. The data provides some demographic data about each person called (customer), such as age, job, marital status and education, along with call details and some features indicating insurance status.

The goal here is to perform exploratory data analysis (EDA), using both visual EDA and some statistics and probability.

For statistics we will use the Chi-square test of independence, which is meant to be used with categorical data. In this case we will be concerned with the effects of some demographic features on the outcome of the calls.

Exponential distributions

Exponential distributions have the probability density funtion

$$f(x; \lambda) = \begin{cases} \lambda e^{-\lambda x} &\text{ } x \geq 0\\ 0 &\text{ } x < 0 \end{cases}$$

and cumulative density function

$$F(x; \lambda) = 1 - e^{-\lambda x}.$$

The expectation value (mean) is $\lambda^{-1}$ and percentiles can be computed using

$$F^{-1}(p; \lambda) = \frac{-ln(1-p)}{\lambda}$$

Where p is the desired percentile. Since we can easily compute the mean of our data, we can use that to get lamba and thus compute percentiles.

In our case, we can easily compute the mean of our data using pandas:

Chi-square tests of independence

Chi-square tests are used to compare two categorical variables from a single population; they are used to determine if there is a significant associtation between them. The test has three requirements:

The two variables being tested are categorical
They data was acquired using simple random sampling
The expected frequency count for each cell of a contingency table for the data is at least 5

For this set we will assume the second requirement has been met.

Problems

Compute descriptive statistics for every feature in the data.
- Create a columm for call duration. Test whether there is a signifiant relationship between call duration and having or not having insurance.
Plot a histogram of call durations using 25 bins. The call durations can be modeled using an exponential distribution. What is the expected value for call duration?
Estimate the exponential distribution and compute the mean, median, and the 95th percentile. Compare these values to those computed from the data.
- What are these values for the two groups: has car insurance, and doesn't have car insurance?
Write a function to select only features that could be Bernoulli random variables and select one of them to
- compute the probability that a random sample (n = 10) would contain exactly 3 successes
- compute the probability that such a sample would contain at most 3 successes
Write a function to create a column that categorizes each age according to the following age ranges:
- 18-24
- 25-34
- 35-44
- 45-54
- 55-64
- 65+
Create a stacked bar chart showing the frequency of each age group. What are the most and least common groups? Now plot and describe the distribution of ages.
Create contingency tables for the following pairs:
- Marital and Outcome
- Education and Outcome
- Job and Outcome
- age_group and Outcome
Transform these contingency tables to show proportions of Outcome by each independent variable (Education, etc.). Plot these tables as heatmaps.
Run Chi-square tests for independence (alpha = 0.05) on all the contingency tables and interpret the results.
Which of these contingency tables are not appropriate for a Chi-square test? Why not?
- Redo the tests using only "failure" and "success" outcomes, and interpret the results. Is the Job table still not appropriate for chi-square?



In [28]:

    
insurance = pd.read_csv('data/car_insurance.csv')
# very useful from the solution: add parse_dates=list_of_dates to have some columns already in datetime format!



In [4]:

    
insurance.head()









    Out[4]:






  
    
      
      Id
      Age
      Job
      Marital
      Education
      Default
      Balance
      HHInsurance
      CarLoan
      Communication
      LastContactDay
      LastContactMonth
      NoOfContacts
      DaysPassed
      PrevAttempts
      Outcome
      CallStart
      CallEnd
      CarInsurance
    
  
  
    
      0
      1
      32
      management
      single
      tertiary
      0
      1218
      1
      0
      telephone
      28
      jan
      2
      -1
      0
      NaN
      13:45:20
      13:46:30
      0
    
    
      1
      2
      32
      blue-collar
      married
      primary
      0
      1156
      1
      0
      NaN
      26
      may
      5
      -1
      0
      NaN
      14:49:03
      14:52:08
      0
    
    
      2
      3
      29
      management
      single
      tertiary
      0
      637
      1
      0
      cellular
      3
      jun
      1
      119
      1
      failure
      16:30:24
      16:36:04
      1
    
    
      3
      4
      25
      student
      single
      primary
      0
      373
      1
      0
      cellular
      11
      may
      2
      -1
      0
      NaN
      12:06:43
      12:20:22
      1
    
    
      4
      5
      30
      management
      married
      tertiary
      0
      2694
      0
      0
      cellular
      3
      jun
      1
      -1
      0
      NaN
      14:35:44
      14:38:56
      0

1. Compute descriptive statistics for every feature in the data



In [5]:

    
insurance.shape









    Out[5]:





(4000, 19)



In [6]:

    
insurance.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 19 columns):
Id                  4000 non-null int64
Age                 4000 non-null int64
Job                 3981 non-null object
Marital             4000 non-null object
Education           3831 non-null object
Default             4000 non-null int64
Balance             4000 non-null int64
HHInsurance         4000 non-null int64
CarLoan             4000 non-null int64
Communication       3098 non-null object
LastContactDay      4000 non-null int64
LastContactMonth    4000 non-null object
NoOfContacts        4000 non-null int64
DaysPassed          4000 non-null int64
PrevAttempts        4000 non-null int64
Outcome             958 non-null object
CallStart           4000 non-null object
CallEnd             4000 non-null object
CarInsurance        4000 non-null int64
dtypes: int64(11), object(8)
memory usage: 593.8+ KB



In [7]:

    
insurance.describe(include='all').T









    Out[7]:






  
    
      
      count
      unique
      top
      freq
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      Id
      4000
      NaN
      NaN
      NaN
      2000.5
      1154.84
      1
      1000.75
      2000.5
      3000.25
      4000
    
    
      Age
      4000
      NaN
      NaN
      NaN
      41.2148
      11.5502
      18
      32
      39
      49
      95
    
    
      Job
      3981
      11
      management
      893
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Marital
      4000
      3
      married
      2304
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Education
      3831
      3
      secondary
      1988
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Default
      4000
      NaN
      NaN
      NaN
      0.0145
      0.119555
      0
      0
      0
      0
      1
    
    
      Balance
      4000
      NaN
      NaN
      NaN
      1532.94
      3511.45
      -3058
      111
      551.5
      1619
      98417
    
    
      HHInsurance
      4000
      NaN
      NaN
      NaN
      0.49275
      0.50001
      0
      0
      0
      1
      1
    
    
      CarLoan
      4000
      NaN
      NaN
      NaN
      0.133
      0.339617
      0
      0
      0
      0
      1
    
    
      Communication
      3098
      2
      cellular
      2831
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      LastContactDay
      4000
      NaN
      NaN
      NaN
      15.7212
      8.42531
      1
      8
      16
      22
      31
    
    
      LastContactMonth
      4000
      12
      may
      1049
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      NoOfContacts
      4000
      NaN
      NaN
      NaN
      2.60725
      3.0642
      1
      1
      2
      3
      43
    
    
      DaysPassed
      4000
      NaN
      NaN
      NaN
      48.7065
      106.685
      -1
      -1
      -1
      -1
      854
    
    
      PrevAttempts
      4000
      NaN
      NaN
      NaN
      0.7175
      2.07865
      0
      0
      0
      0
      58
    
    
      Outcome
      958
      3
      failure
      437
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      CallStart
      4000
      3777
      15:48:27
      3
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      CallEnd
      4000
      3764
      11:27:46
      3
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      CarInsurance
      4000
      NaN
      NaN
      NaN
      0.401
      0.490162
      0
      0
      0
      1
      1

Create call_duration column



In [8]:

    
# print(min(pd.to_datetime(insurance.CallStart)), max(pd.to_datetime(insurance.CallStart)))
# print(min(pd.to_datetime(insurance.CallEnd)), max(pd.to_datetime(insurance.CallEnd)))

insurance['call_duration'] = (pd.to_datetime(insurance.CallEnd) - pd.to_datetime(insurance.CallStart)).astype('timedelta64[s]')



In [9]:

    
insurance.call_duration.head()









    Out[9]:





0     70.0
1    185.0
2    340.0
3    819.0
4    192.0
Name: call_duration, dtype: float64

With and Without Car Insurance



In [10]:

    
insurance[['CarInsurance', 'call_duration']].corr()









    Out[10]:






  
    
      
      CarInsurance
      call_duration
    
  
  
    
      CarInsurance
      1.000000
      0.480346
    
    
      call_duration
      0.480346
      1.000000

The variables doesn't seem to be highly correlated.



In [12]:

    
fig = plt.figure(figsize=(16, 10))
ax = sns.boxplot(x='CarInsurance', y='call_duration', data=insurance);

In the solutions a t-test is performed, but it isn't very useful so I'm not reproducing it.

2. a) Plot a histogram of call durations using 25 bins



In [14]:

    
fig = plt.figure(figsize=(16, 8))
ax = sns.distplot(insurance.call_duration, bins=25, kde=False);

2. b) Expected Call Duration



In [15]:

    
insurance.call_duration.mean()









    Out[15]:





350.844

3. Estimate the exponential distribution and compute the mean, median, and the 95th percentile



In [16]:

    
l = insurance.call_duration.mean()**-1

mean_est = l**-1
median_est = -np.log(1-0.5)/l
perc95_est = -np.log(1-0.95)/l

print('Estimated mean, median and 95th percentile: {:.5f}, {:.5f}, {:.5f}'.format(
    mean_est,
      median_est,
      perc95_est))









    



Estimated mean, median and 95th percentile: 350.84400, 243.18653, 1051.03469



In [17]:

    
print('Mean, median and 95th percentile: {:.5f}, {:.5f}, {:.5f}'.format(
    insurance.call_duration.mean(),
      insurance.call_duration.median(),
      insurance.call_duration.quantile(0.95)))









    



Mean, median and 95th percentile: 350.84400, 232.00000, 1058.20000

What are these values for the two groups: has car insurance, and doesn't have car insurance?



In [18]:

    
print('Has car insurance mean, median and 95th percentile: {:.5f}, {:.5f}, {:.5f}'.format(
    insurance[insurance.CarInsurance == 1].call_duration.mean(),
      insurance[insurance.CarInsurance == 1].call_duration.median(),
      insurance[insurance.CarInsurance == 1].call_duration.quantile(0.95)))

print('Doesn\'t have car insurance mean, median and 95th percentile: {:.5f}, {:.5f}, {:.5f}'.format(
    insurance[insurance.CarInsurance == 0].call_duration.mean(),
      insurance[insurance.CarInsurance == 0].call_duration.median(),
      insurance[insurance.CarInsurance == 0].call_duration.quantile(0.95)))









    



Has car insurance mean, median and 95th percentile: 551.73815, 447.00000, 1343.55000
Doesn't have car insurance mean, median and 95th percentile: 216.35559, 157.00000, 582.50000

4. Write a function to select only features that could be binomial random variables and select one of them to

compute the probability that a random sample (n = 10) would contain exactly 3 successes
compute the probability that such a sample would contain at most 3 successes

Write a function to select only features that could be binomial random variables



In [19]:

    
for col in insurance.columns:
    if set(insurance[col].unique()) == set([0, 1]):
        print(col)









    



Default
HHInsurance
CarLoan
CarInsurance

Compute P(X = 3)



In [20]:

    
from scipy.special import binom as binom
p = insurance.CarInsurance.sum()/insurance.CarInsurance.count()
print(binom(10, 3) * p**3 * (1-p)**7)
print(stats.binom.pmf(3, n=10, p=p))









    



0.214092828857
0.214092828857

Compute P(X &leq 3)



In [21]:

    
print(stats.binom.cdf(3, n=10, p=p))









    



0.379775522486

5. Write a function to create a column that categorizes each age according to the following age ranges:

- 18-24
- 25-34
- 35-44
- 45-54
- 55-64
- 65+



In [29]:

    
def age_cat(df):
    if df.Age >= 65:
        return '65+'
    elif df.Age >= 55:
        return '55-64'
    elif df.Age >= 45:
        return '45-54'
    elif df.Age >= 35:
        return '35-44'
    elif df.Age >= 25:
        return '25-34'
    elif df.Age >= 18:
        return '18-24'
    else:
        return('Something strange')



In [30]:

    
insurance['age_categories'] = insurance.apply(age_cat, axis=1)

6. Create a stacked bar chart showing the frequency of each age group. What are the most and least common groups?



In [59]:

    
# fig = plt.figure(figsize=(16, 6))
# ax = plt.axes()
# ax = sns.barplot(x='index', y='age_categories', stacked=True, data=insurance.age_categories.value_counts().reset_index())
# ax.set_xlabel('Age Categories')
# ax.set_ylabel('Count');

ax = pd.DataFrame(insurance.age_categories.value_counts()).T.plot.barh(figsize=(16, 6), stacked=True)

Now plot and describe the distribution of ages



In [37]:

    
fig = plt.figure(figsize=(16, 6))
ax = plt.axes()
ax = sns.distplot(insurance.Age);









    



C:\Users\alessandro.diantonio\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

The distribution seems normal but it is a little skewed to the right and it has two other peaks, it could be closer to a bimodal distribution.



In [61]:

    
print(insurance.Age.describe())
print(insurance.Age.skew())
print(insurance.Age.kurtosis())









    



count    4000.000000
mean       41.214750
std        11.550194
min        18.000000
25%        32.000000
50%        39.000000
75%        49.000000
max        95.000000
Name: Age, dtype: float64
0.763293030196
0.492527093977

7. Create contingency tables for the following pairs:

- Education and Outcome
- CarInsurance and Outcome
- Job and Outcome
- age_group and Outcome



In [55]:

    
edu_contingency = pd.pivot_table(insurance, values='Id', index='Education', columns='Outcome', aggfunc='count', margins=True)
ins_contingency = pd.pivot_table(insurance, values='Id', index='CarInsurance', columns='Outcome', aggfunc='count', margins=True)
job_contingency = pd.pivot_table(insurance, values='Id', index='Job', columns='Outcome', aggfunc='count', margins=True)
age_contingency = pd.pivot_table(insurance, values='Id', index='age_categories', columns='Outcome', aggfunc='count', margins=True)



In [59]:

    
print('Education vs. Outcome:\n\n{}\n\n'.format(edu_contingency))
print('Car insurance vs. Outcome:\n\n{}\n\n'.format(ins_contingency))
print('Job vs. Outcome:\n\n{}\n\n'.format(job_contingency))
print('Age category vs. Outcome:\n\n{}\n\n'.format(age_contingency))









    



Education vs. Outcome:

Outcome    failure  other  success    All
Education                                
primary       50.0   28.0     24.0  102.0
secondary    223.0   95.0    147.0  465.0
tertiary     148.0   63.0    137.0  348.0
All          421.0  186.0    308.0  915.0


Car insurance vs. Outcome:

Outcome       failure  other  success    All
CarInsurance                                
0               261.0  103.0     35.0  399.0
1               176.0   92.0    291.0  559.0
All             437.0  195.0    326.0  958.0


Job vs. Outcome:

Outcome        failure  other  success    All
Job                                          
admin.            56.0   31.0     33.0  120.0
blue-collar       76.0   27.0     33.0  136.0
entrepreneur      13.0    1.0      2.0   16.0
housemaid          9.0    4.0      2.0   15.0
management       108.0   43.0     94.0  245.0
retired           16.0   11.0     43.0   70.0
self-employed      9.0    8.0     11.0   28.0
services          38.0   11.0     22.0   71.0
student           15.0   15.0     18.0   48.0
technician        77.0   35.0     44.0  156.0
unemployed        18.0    9.0     21.0   48.0
All              435.0  195.0    323.0  953.0


Age category vs. Outcome:

Outcome         failure  other  success    All
age_categories                                
18-24               7.0    8.0     15.0   30.0
25-34             163.0   69.0     87.0  319.0
35-44             134.0   66.0     88.0  288.0
45-54              77.0   30.0     55.0  162.0
55-64              45.0   15.0     54.0  114.0
65+                11.0    7.0     27.0   45.0
All               437.0  195.0    326.0  958.0

8. Transform these contingency tables to show proportions of Outcome by each independent variable (Education, etc.). Plot these tables as heatmaps.



In [67]:

    
def contingency_heatmap(df):
    data = df.drop('All').drop('All', axis=1) / df.loc['All', 'All']
    fig = plt.figure(figsize=(10, 6))
    ax = sns.heatmap(data)



In [68]:

    
contingency_heatmap(edu_contingency)
contingency_heatmap(ins_contingency)
contingency_heatmap(job_contingency)
contingency_heatmap(age_contingency)

9. Run Chi-square tests for independence (alpha = 0.05) on all the contingency tables and interpret the results.



In [70]:

    
stats.chi2_contingency(edu_contingency.drop('All').drop('All', axis=1))









    Out[70]:





(11.952816277914444,
 0.017705606327210261,
 4,
 array([[  46.93114754,   20.73442623,   34.33442623],
        [ 213.95081967,   94.52459016,  156.52459016],
        [ 160.11803279,   70.74098361,  117.14098361]]))



In [72]:

    
stats.chi2_contingency(ins_contingency.drop('All').drop('All', axis=1))









    Out[72]:





(196.95589456839059,
 1.7043975659567056e-43,
 2,
 array([[ 182.00730689,   81.21607516,  135.77661795],
        [ 254.99269311,  113.78392484,  190.22338205]]))



In [73]:

    
stats.chi2_contingency(job_contingency.drop('All').drop('All', axis=1))









    Out[73]:





(61.785049632441783,
 3.7565863838377321e-06,
 20,
 array([[  54.77439664,   24.55403987,   40.67156348],
        [  62.07764953,   27.82791186,   46.09443861],
        [   7.30325289,    3.27387198,    5.42287513],
        [   6.84679958,    3.06925498,    5.08394544],
        [ 111.83105981,   50.13116474,   83.03777545],
        [  31.95173137,   14.32318993,   23.7250787 ],
        [  12.78069255,    5.72927597,    9.49003148],
        [  32.40818468,   14.52780693,   24.06400839],
        [  21.90975866,    9.82161595,   16.26862539],
        [  71.20671563,   31.92025184,   52.87303253],
        [  21.90975866,    9.82161595,   16.26862539]]))



In [74]:

    
stats.chi2_contingency(age_contingency.drop('All').drop('All', axis=1))









    Out[74]:





(38.792661155543612,
 2.7600837200876502e-05,
 10,
 array([[  13.68475992,    6.10647182,   10.20876827],
        [ 145.51461378,   64.93215031,  108.55323591],
        [ 131.3736952 ,   58.62212944,   98.00417537],
        [  73.89770355,   32.97494781,   55.12734864],
        [  52.00208768,   23.2045929 ,   38.79331942],
        [  20.52713987,    9.15970772,   15.3131524 ]]))

10. Which of these contingency tables are not appropriate for a Chi-square test? Why not?

Answer: the job table because it has counts under 5 in some cells.

Redo the tests using only "failure" and "success" outcomes, and interpret the results. Is the Job table still not appropriate for chi-square?



In [77]:

    
stats.chi2_contingency(edu_contingency.drop('All').drop(['All', 'other'], axis=1))









    Out[77]:





(7.8431814254510837,
 0.019809558351792121,
 2,
 array([[  42.73525377,   31.26474623],
        [ 213.67626886,  156.32373114],
        [ 164.58847737,  120.41152263]]))



In [78]:

    
stats.chi2_contingency(ins_contingency.drop('All').drop(['All', 'other'], axis=1))









    Out[78]:





(186.66149555596792,
 1.7025021333063239e-42,
 1,
 array([[ 169.53079948,  126.46920052],
        [ 267.46920052,  199.53079948]]))



In [79]:

    
stats.chi2_contingency(job_contingency.drop('All').drop(['All', 'other'], axis=1))









    Out[79]:





(47.206537724922249,
 8.6599554431890958e-07,
 10,
 array([[  51.07519789,   37.92480211],
        [  62.55277045,   46.44722955],
        [   8.60817942,    6.39182058],
        [   6.31266491,    4.68733509],
        [ 115.92348285,   86.07651715],
        [  33.85883905,   25.14116095],
        [  11.47757256,    8.52242744],
        [  34.43271768,   25.56728232],
        [  18.93799472,   14.06200528],
        [  69.43931398,   51.56068602],
        [  22.38126649,   16.61873351]]))



In [80]:

    
stats.chi2_contingency(age_contingency.drop('All').drop(['All', 'other'], axis=1))









    Out[80]:





(31.280226816255265,
 8.2463227172663913e-06,
 5,
 array([[  12.60026212,    9.39973788],
        [ 143.18479685,  106.81520315],
        [ 127.14809961,   94.85190039],
        [  75.60157274,   56.39842726],
        [  56.70117955,   42.29882045],
        [  21.76408912,   16.23591088]]))

The job table still isn't good:



In [81]:

    
job_contingency.drop('All').drop(['All', 'other'], axis=1)









    Out[81]:






  
    
      Outcome
      failure
      success
    
    
      Job
      
      
    
  
  
    
      admin.
      56.0
      33.0
    
    
      blue-collar
      76.0
      33.0
    
    
      entrepreneur
      13.0
      2.0
    
    
      housemaid
      9.0
      2.0
    
    
      management
      108.0
      94.0
    
    
      retired
      16.0
      43.0
    
    
      self-employed
      9.0
      11.0
    
    
      services
      38.0
      22.0
    
    
      student
      15.0
      18.0
    
    
      technician
      77.0
      44.0
    
    
      unemployed
      18.0
      21.0

Thus, from the tests we can say that the Outcome variable is dependent from Education, CarInsurance and age_categories.

EDA: Flight Cancellations & Delays

Explore the data

This data set comes from the featured Kaggle datasets, containing three tables that relate to commercial airline flights. The flight delay and cancellation data was collected and published by the DOT's Bureau of Transportation Statistics.

There are three tables:

airlines : contains airline IATA codes with corresponding airline names
airports : contains IATA_CODE: Location Identifier String

AIRPORT: Airport's Name String

CITY: City Name of the Airport String

STATE: State Name of the Airport String

COUNTRY: Country Name of the Airport String

LATITUDE: Latitude of the Airport Numeric

LONGITUDE: Longitude of the Airport Numeric
flights : contains many features with descriptions that can be found here, by selecting flights.csv in the data and clicking on Edit descriptions above the data preview.

We cleaned this data in the cleaning notebook, and recall the mention that we started with a sample of the full set available on Kaggle, in order to reduce processing times.

Possion distributions / Poisson process

The Poisson distribution represents the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known average rate and independently of the time since the last event. This comment about independence is actually referring to the memoryless property of the exponential distribution (a distribtuion introduced earlier). Put another way, the Poisson distribution represents a process of a series of events that each follow the same exponential distribution.

The Poisson probability formula:

$$P(x; \mu) = e^{-\mu}\frac{\mu^{x}}{x!}$$

where $x$ is the number of successes and $\mu$ is the mean of the distribution.

Logarithmic transformations

Log transformations can be useful for a number of reasons; two reasons include scaling data for easier interpretation or transforming data to fit a distribution that can be modeled (e.g. making exponential data linear).

Example (notice how x-values beyond 10 are off the charts, but ln(x) is at visible scales out to 1000 and beyond):

Problems

Use pandas apply() method to create a table that contains the following stats for each airline:
- flight volume (frequency in the data for each airline)
- flight proportion (proportion of activity for all airlines)
- diverted rate
- cancelled rate
Compute the volume of activity by airport and airline/airport pairs and do the following:
- Visualize volume by airline, airport, and airline/airport pairs
  - use airline names instead of IATA codes
- Which airline has the highest volume overall? Which airport is the busiest for that airline?
Determine expected times between cancellations for an airline of your choosing.
- Create a table that shows expected time between cancellations, number of cancellations, and number of records per airport.
- Compute and create a table that shows the expected time between cancellations for this airline by month for the top 10 airports. Create a visual for this.
Model the time between cancellations as a Poisson distribution
- For the Delta Air data for BOS airport in month 5 (May), compute the probability of the expected time between cancellation exceeding 5.
Is there a significant difference between the three most active airlines for overall cancellation rates?
Show the relationship between travel routes (origin -> destination) and diversion rates for Delta Air and the top 10 airports from earlier
Create and fill a column for mean ground speed for every flight in miles/hour
- Do any of these seem unusual?
- (distance column is in miles)
Create a visual for the mean times of delay reasons for each airline. Create one for delay proportions for each airline.
Plot the delay reasons by airline using a parallel coordinates plot
- this can take a while so use a random sample of 10% of the data

1. Use pandas `apply()` method to create a table that contains the following stats for each airline:

flight volume (frequency in the data for each airline)
flight proportion (proportion of activity for all airlines)
diverted rate
cancelled rate



In [3]:

    
flights = pd.read_csv('data/flights_cleaned_sm.csv')



In [63]:

    
flights.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 611571 entries, 0 to 611570
Data columns (total 31 columns):
year                   611571 non-null int64
month                  611571 non-null int64
day                    611571 non-null int64
day_of_week            611571 non-null int64
airline                611571 non-null object
flight_number          611571 non-null int64
tail_number            597204 non-null object
origin_airport         611571 non-null object
destination_airport    611571 non-null object
scheduled_departure    611571 non-null object
departure_time         527757 non-null object
departure_delay        527757 non-null float64
taxi_out               524959 non-null float64
wheels_off             524959 non-null object
scheduled_time         611566 non-null float64
elapsed_time           522965 non-null float64
air_time               524141 non-null float64
distance               611571 non-null int64
wheels_on              524141 non-null object
taxi_in                524141 non-null float64
scheduled_arrival      611571 non-null object
arrival_time           524141 non-null object
arrival_delay          524141 non-null float64
diverted               611571 non-null int64
cancelled              611571 non-null int64
cancellation_reason    87430 non-null object
air_system_delay       100292 non-null float64
security_delay         100292 non-null float64
airline_delay          100292 non-null float64
late_aircraft_delay    100292 non-null float64
weather_delay          100292 non-null float64
dtypes: float64(12), int64(8), object(11)
memory usage: 144.6+ MB



In [4]:

    
airlines = pd.read_csv('data/airlines_cleaned.csv')



In [65]:

    
flights_volume = flights.groupby('airline')[['cancelled', 'diverted']].mean().merge(
    pd.DataFrame(flights.airline.value_counts()),
    left_index=True,
    right_index=True)
flights_volume.rename(columns={'airline':'flight_volume', 'diverted':'diverted_rate', 'cancelled':'cancelled_rate'}, inplace=True)
flights_volume['flight_proportion'] = flights_volume.flight_volume.apply(lambda x: x/len(flights))

flights_volume

# from solution:
# airline_stats = flights.groupby('airline').apply(
# lambda x: Series({
#     'flight_vol': x['airline'].count(),
#     'flight_prop': x['airline'].count() / len(flights),
#     'diverted_rate': x['diverted'].mean(),
#     'cancelled_rate': x['cancelled'].mean()}))









    Out[65]:






  
    
      
      cancelled_rate
      diverted_rate
      flight_volume
      flight_proportion
    
  
  
    
      WN
      0.120707
      0.002104
      130241
      0.212961
    
    
      DL
      0.045687
      0.002052
      83328
      0.136252
    
    
      AA
      0.137234
      0.002302
      74282
      0.121461
    
    
      EV
      0.223993
      0.002353
      65868
      0.107703
    
    
      OO
      0.155614
      0.001777
      62456
      0.102124
    
    
      UA
      0.122797
      0.001715
      52485
      0.085820
    
    
      MQ
      0.365076
      0.001336
      40408
      0.066072
    
    
      B6
      0.147032
      0.002482
      28198
      0.046107
    
    
      US
      0.173152
      0.000809
      23488
      0.038406
    
    
      AS
      0.039354
      0.001609
      16161
      0.026425
    
    
      NK
      0.154378
      0.000874
      12586
      0.020580
    
    
      F9
      0.065410
      0.000908
      8806
      0.014399
    
    
      HA
      0.022210
      0.000978
      7159
      0.011706
    
    
      VX
      0.083702
      0.001474
      6105
      0.009982

2. Compute the volume of activity by airport and airline/airport pairs and do the following:

Visualize volume by airline, airport, and airline/airport pairs
- use airline names instead of IATA codes
Which airline has the highest volume overall? Which airport is the busiest for that airline?



In [7]:

    
volume_airline_airport = flights.groupby(['airline', 'origin_airport']).year.count()\
    .reset_index()\
    .merge(airlines, left_on='airline', right_on='iata_code', how='left')\
    .drop(['iata_code', 'airline_x'], axis=1)\
    .rename(columns={'origin_airport':'airport', 'airline_y':'airline', 'year':'volume'})



In [8]:

    
fig = plt.figure(figsize=(16, 16))
ax = sns.barplot(y='airport', x='volume', data=volume_airline_airport.groupby('airport').sum().reset_index().sort_values(by='volume', ascending=False).head(20))
ax.set_title('Volume by airport');



In [9]:

    
fig = plt.figure(figsize=(16, 8))
ax = sns.barplot(x='airline', y='volume', data=volume_airline_airport.groupby('airline').sum().reset_index())
ax.set_title('Volume by airline')
plt.xticks(rotation=45);



In [10]:

    
fig = plt.figure(figsize=(16, 48))
#ax = plt.axes()

sns.heatmap(volume_airline_airport.pivot_table(values='volume', index='airport', columns='airline', aggfunc='sum', fill_value=0))









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x1fc570ebbe0>



In [11]:

    
print(volume_airline_airport.groupby('airline').sum()
      .reset_index()
      .sort_values(by='volume', ascending=False)
      .iloc[0, 0])
print(volume_airline_airport[volume_airline_airport.airline=='Southwest Airlines Co.'].groupby('airport').sum()
      .reset_index()
      .sort_values(by='volume', ascending=False)
      .iloc[0, 0])









    



Southwest Airlines Co.
MDW

3. Determine expected times between cancellations for an airline of your choosing.

Create a table that shows expected time between cancellations, number of cancellations, and number of records per airport.
Compute and create a table that shows the expected time between cancellations for this airline by month for the top 10 airports. Create a visual for this.

I misinterpreted the expected time between cancellations to be the mean of the times between two cancellations; in the solution the column is the max - min scheduled departure divided by the number of cancellations:



In [66]:

    
# def cancellations_by_airport(airline):

#     temp = flights[(flights.cancelled==1) & (flights.airline==airline)]\
#             .sort_values(by=['scheduled_departure', 'flight_number'])\
#             .loc[:, ['scheduled_departure', 'flight_number', 'origin_airport']]\
#             .set_index('flight_number')

#     temp['scheduled_departure'] = pd.to_datetime(temp['scheduled_departure'])
#     temp['time_between'] = (temp.scheduled_departure - temp.scheduled_departure.shift(1)).astype('timedelta64[s]') / 3600

#     temp = temp.groupby('origin_airport').aggregate([np.mean, 'count'])
#     temp.columns = temp.columns.droplevel()

#     return pd.DataFrame(flights[flights.airline==airline].groupby('origin_airport').year.count())\
#                 .merge(temp, left_index=True, right_index=True)\
#                 .rename(columns={'year':'number_flights', 'mean':'expected_time_between', 'count':'number_cancellations'})



In [69]:

    
# cancellations = cancellations_by_airport('DL')#.sort_values(by='number_cancellations', ascending=False)

So let's do it again this way:



In [8]:

    
cancellations = flights[flights.airline=='DL'].groupby('origin_airport').apply(
    lambda x: pd.Series({
        'number_flights': x.year.count(),
        'number_cancellations': x.cancelled.sum(),
        'expected_time_between': ((pd.to_datetime(x.scheduled_departure.max()) -
                                  pd.to_datetime(x.scheduled_departure.min())).seconds/3600) / (x.cancelled.sum() + 1e-6)
    })
)



In [9]:

    
cancellations









    Out[9]:






  
    
      
      expected_time_between
      number_cancellations
      number_flights
    
    
      origin_airport
      
      
      
    
  
  
    
      ABE
      7.805553e+00
      3.0
      36.0
    
    
      ABQ
      1.283332e+00
      1.0
      107.0
    
    
      AGS
      1.875000e+07
      0.0
      71.0
    
    
      ALB
      3.599999e+00
      3.0
      112.0
    
    
      ANC
      0.000000e+00
      0.0
      207.0
    
    
      ATL
      1.539162e-02
      915.0
      22914.0
    
    
      ATW
      0.000000e+00
      1.0
      50.0
    
    
      AUS
      1.980000e+00
      10.0
      340.0
    
    
      AVL
      2.545833e+00
      4.0
      57.0
    
    
      AVP
      0.000000e+00
      0.0
      14.0
    
    
      BDL
      0.000000e+00
      28.0
      375.0
    
    
      BGR
      1.296667e+07
      0.0
      19.0
    
    
      BHM
      1.375000e+00
      12.0
      298.0
    
    
      BIL
      3.333333e+05
      0.0
      88.0
    
    
      BIS
      2.250000e+07
      0.0
      22.0
    
    
      BNA
      5.061728e-02
      27.0
      506.0
    
    
      BOI
      8.050000e+06
      0.0
      104.0
    
    
      BOS
      2.431611e-02
      329.0
      1698.0
    
    
      BTR
      2.900000e+06
      0.0
      34.0
    
    
      BTV
      1.666665e-02
      1.0
      11.0
    
    
      BUF
      7.988886e+00
      3.0
      236.0
    
    
      BWI
      3.181818e-01
      22.0
      698.0
    
    
      BZN
      1.900000e+07
      0.0
      111.0
    
    
      CAE
      1.500000e+00
      3.0
      88.0
    
    
      CAK
      2.041666e-01
      4.0
      148.0
    
    
      CHA
      3.819444e+00
      6.0
      71.0
    
    
      CHO
      7.833331e+00
      3.0
      35.0
    
    
      CHS
      7.017543e-02
      19.0
      351.0
    
    
      CLE
      2.997916e+00
      8.0
      233.0
    
    
      CLT
      1.715151e+00
      11.0
      611.0
    
    
      ...
      ...
      ...
      ...
    
    
      ROA
      1.170833e+01
      2.0
      49.0
    
    
      ROC
      3.062499e+00
      4.0
      128.0
    
    
      RSW
      1.812500e-01
      24.0
      591.0
    
    
      SAN
      1.478788e+00
      11.0
      624.0
    
    
      SAT
      2.700000e+00
      8.0
      299.0
    
    
      SAV
      2.884615e-01
      13.0
      299.0
    
    
      SDF
      0.000000e+00
      15.0
      274.0
    
    
      SEA
      8.564102e-01
      13.0
      1367.0
    
    
      SFO
      2.480620e-01
      43.0
      1065.0
    
    
      SGF
      8.333325e-02
      1.0
      32.0
    
    
      SHV
      2.941665e+00
      2.0
      30.0
    
    
      SJC
      0.000000e+00
      4.0
      152.0
    
    
      SJU
      7.144927e-01
      23.0
      323.0
    
    
      SLC
      3.671171e-01
      37.0
      3381.0
    
    
      SMF
      6.888887e-01
      3.0
      158.0
    
    
      SNA
      3.499999e-01
      5.0
      268.0
    
    
      SRQ
      6.749999e-01
      6.0
      225.0
    
    
      STL
      9.791665e-01
      8.0
      425.0
    
    
      STT
      1.447222e+00
      6.0
      104.0
    
    
      STX
      1.150000e+06
      0.0
      2.0
    
    
      SYR
      0.000000e+00
      2.0
      124.0
    
    
      TLH
      0.000000e+00
      5.0
      128.0
    
    
      TPA
      8.018868e-02
      53.0
      1065.0
    
    
      TRI
      7.944442e+00
      3.0
      45.0
    
    
      TUL
      4.027776e+00
      3.0
      105.0
    
    
      TUS
      1.833333e-01
      3.0
      83.0
    
    
      TVC
      2.391667e+07
      0.0
      38.0
    
    
      TYS
      2.952777e+00
      6.0
      106.0
    
    
      VPS
      2.503703e+00
      9.0
      184.0
    
    
      XNA
      7.499996e+00
      2.0
      49.0
    
  

153 rows × 3 columns



In [14]:

    
# def cancellations_by_airport_month(airline):

#     temp = flights[(flights.cancelled==1) & (flights.airline==airline)]\
#             .sort_values(by=['scheduled_departure', 'flight_number'])\
#             .loc[:, ['scheduled_departure', 'flight_number', 'origin_airport', 'month']]\
#             .set_index('flight_number')

#     temp['scheduled_departure'] = pd.to_datetime(temp['scheduled_departure'])
#     temp['time_between'] = (temp.scheduled_departure - temp.scheduled_departure.shift(1)).astype('timedelta64[s]') / 3600

#     temp = temp.groupby(['origin_airport', 'month']).aggregate([np.mean, 'count'])
#     temp.columns = temp.columns.droplevel()

#     return pd.DataFrame(flights[flights.airline==airline].groupby(['origin_airport', 'month']).year.count())\
#                 .merge(temp, left_index=True, right_index=True)\
#                 .rename(columns={'year':'number_flights', 'mean':'expected_time_between', 'count':'number_cancellations'})



In [17]:

    
def cancellations_by_airport_month(airline):

    cancellations_by_month = flights[flights.airline==airline].groupby(['origin_airport', 'year', 'month']).apply(
        lambda x: pd.Series({
            'number_flights': x.year.count(),
            'number_cancellations': x.cancelled.sum(),
            'expected_time_between': ((pd.to_datetime(x.scheduled_departure.max()) -
                                      pd.to_datetime(x.scheduled_departure.min())).seconds/3600) / (x.cancelled.sum() + 1e-6)
        })
    )
    
    return cancellations_by_month



In [18]:

    
top10 = ['ATL', 'ORD', 'DFW', 'DEN', 'LAX', 'IAH', 'SFO', 'PHX', 'LAS', 'LGA']

cancellation_month = cancellations_by_airport_month('DL').loc[top10].reset_index()



In [19]:

    
cancellation_month









    Out[19]:






  
    
      
      origin_airport
      year
      month
      expected_time_between
      number_cancellations
      number_flights
    
  
  
    
      0
      ATL
      2015
      1
      1.400000e-01
      90.0
      1894.0
    
    
      1
      ATL
      2015
      2
      2.863512e-02
      486.0
      2191.0
    
    
      2
      ATL
      2015
      3
      2.256410e-01
      65.0
      2113.0
    
    
      3
      ATL
      2015
      4
      3.579545e-01
      44.0
      2084.0
    
    
      4
      ATL
      2015
      5
      3.098214e-01
      56.0
      2088.0
    
    
      5
      ATL
      2015
      6
      1.724138e-01
      87.0
      2097.0
    
    
      6
      ATL
      2015
      7
      1.306944e+00
      12.0
      2187.0
    
    
      7
      ATL
      2015
      8
      1.356060e+00
      11.0
      2197.0
    
    
      8
      ATL
      2015
      9
      1.551667e+07
      0.0
      2063.0
    
    
      9
      ATL
      2015
      11
      7.583330e+00
      2.0
      2041.0
    
    
      10
      ATL
      2015
      12
      2.392473e-01
      62.0
      1959.0
    
    
      11
      DEN
      2015
      1
      6.111110e-01
      6.0
      65.0
    
    
      12
      DEN
      2015
      2
      7.124999e-01
      12.0
      63.0
    
    
      13
      DEN
      2015
      3
      4.266666e+00
      5.0
      84.0
    
    
      14
      DEN
      2015
      4
      1.724998e+01
      1.0
      64.0
    
    
      15
      DEN
      2015
      5
      1.108332e+01
      1.0
      69.0
    
    
      16
      DEN
      2015
      6
      5.666667e+06
      0.0
      73.0
    
    
      17
      DEN
      2015
      7
      1.583333e+06
      0.0
      69.0
    
    
      18
      DEN
      2015
      8
      1.016667e+07
      0.0
      69.0
    
    
      19
      DEN
      2015
      9
      5.083333e+06
      0.0
      65.0
    
    
      20
      DEN
      2015
      11
      1.000000e+07
      0.0
      62.0
    
    
      21
      DEN
      2015
      12
      1.933331e+01
      1.0
      68.0
    
    
      22
      DFW
      2015
      1
      2.041667e+07
      0.0
      35.0
    
    
      23
      DFW
      2015
      2
      8.809523e-01
      7.0
      61.0
    
    
      24
      DFW
      2015
      3
      4.316667e+06
      0.0
      51.0
    
    
      25
      DFW
      2015
      4
      4.066663e+00
      1.0
      45.0
    
    
      26
      DFW
      2015
      5
      8.916658e+00
      1.0
      45.0
    
    
      27
      DFW
      2015
      6
      9.549995e+00
      2.0
      53.0
    
    
      28
      DFW
      2015
      7
      2.183331e+01
      1.0
      44.0
    
    
      29
      DFW
      2015
      8
      6.499994e+00
      1.0
      59.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      80
      ORD
      2015
      4
      0.000000e+00
      3.0
      57.0
    
    
      81
      ORD
      2015
      5
      7.041663e+00
      2.0
      71.0
    
    
      82
      ORD
      2015
      6
      2.569444e+00
      6.0
      72.0
    
    
      83
      ORD
      2015
      7
      0.000000e+00
      1.0
      57.0
    
    
      84
      ORD
      2015
      8
      3.033333e+06
      0.0
      69.0
    
    
      85
      ORD
      2015
      9
      1.158333e+07
      0.0
      72.0
    
    
      86
      ORD
      2015
      11
      7.483333e+06
      0.0
      71.0
    
    
      87
      ORD
      2015
      12
      6.249998e-01
      4.0
      57.0
    
    
      88
      PHX
      2015
      1
      4.449998e+00
      2.0
      46.0
    
    
      89
      PHX
      2015
      2
      5.112499e+00
      4.0
      59.0
    
    
      90
      PHX
      2015
      3
      5.527776e+00
      3.0
      67.0
    
    
      91
      PHX
      2015
      4
      2.583331e+00
      1.0
      64.0
    
    
      92
      PHX
      2015
      5
      4.616667e+06
      0.0
      73.0
    
    
      93
      PHX
      2015
      6
      2.216664e+01
      1.0
      55.0
    
    
      94
      PHX
      2015
      7
      1.666667e+07
      0.0
      69.0
    
    
      95
      PHX
      2015
      8
      1.335000e+07
      0.0
      69.0
    
    
      96
      PHX
      2015
      9
      1.516667e+07
      0.0
      56.0
    
    
      97
      PHX
      2015
      11
      3.633333e+06
      0.0
      66.0
    
    
      98
      PHX
      2015
      12
      7.133326e+00
      1.0
      74.0
    
    
      99
      SFO
      2015
      1
      8.020832e-01
      8.0
      71.0
    
    
      100
      SFO
      2015
      2
      3.437500e-01
      8.0
      72.0
    
    
      101
      SFO
      2015
      3
      6.818181e-01
      11.0
      74.0
    
    
      102
      SFO
      2015
      4
      1.149999e+01
      2.0
      88.0
    
    
      103
      SFO
      2015
      5
      1.673333e+07
      0.0
      82.0
    
    
      104
      SFO
      2015
      6
      4.020832e+00
      4.0
      112.0
    
    
      105
      SFO
      2015
      7
      5.388887e+00
      3.0
      116.0
    
    
      106
      SFO
      2015
      8
      1.849998e+01
      1.0
      128.0
    
    
      107
      SFO
      2015
      9
      1.658333e+07
      0.0
      102.0
    
    
      108
      SFO
      2015
      11
      8.149992e+00
      1.0
      112.0
    
    
      109
      SFO
      2015
      12
      2.150000e+00
      5.0
      108.0
    
  

110 rows × 6 columns



In [28]:

    
fig = plt.figure(figsize=(16, 8))
ax = plt.axes()
ax = sns.pointplot(x='month', y='expected_time_between', hue='origin_airport', data=cancellation_month)
ax.set_title('Expected time between cancellations for top 10 airports')
ax.set_ylabel('Mean expected time (hours)')
ax.set_xlabel('Month')
ax.set_yscale('log')
ax.legend(title='Airport');

4. Model the time between cancellations as a Poisson distribution

For the Delta Air data for BOS airport in month 5 (May), compute the probability of the expected time between cancellation exceeding 5.



In [33]:

    
import math

mu = cancellations_by_airport_month('DL').loc['BOS', 2015, 5].expected_time_between
x = 5

print(stats.poisson.sf(k=x, mu=mu))









    



0.118631500385

5. Is there a significant difference between the three most active airlines for overall cancellation rates?



In [38]:

    
cancellation_counts = flights[flights.airline.isin(['WN', 'DL', 'AA'])].pivot_table(values='year', columns='cancelled', index='airline', aggfunc='count')
cancellation_counts



In [39]:

    
stats.chi2_contingency(cancellation_counts)









    Out[39]:





(4337.104235239527, 0.0, 2, array([[  66612.02628443,    7669.97371557],
        [  74723.98328302,    8604.01671698],
        [ 116792.99043255,   13448.00956745]]))

The expected frequencies are all above 5 and the p-value is 0, so we can reject the null hypothesis that there is no difference between the cancellation rates for the top 3 airlines.

6. Show the relationship between travel routes (origin -> destination) and diversion rates for Delta Air and the top 10 airports from earlier



In [40]:

    
travel_routes = flights[(flights.airline=='DL')
                        & ((flights.origin_airport.isin(top10))
                            & (flights.destination_airport.isin(top10)))]\
                .loc[:, ['flight_number', 'origin_airport', 'destination_airport', 'diverted']]

# travel_routes['route'] = travel_routes.origin_airport + '-' + travel_routes.destination_airport
# travel_routes.groupby('route').diverted.sum() / travel_routes.groupby('route').diverted.count()

travel_routes.groupby(['origin_airport', 'destination_airport']).diverted.mean().unstack()









    Out[40]:






  
    
      destination_airport
      ATL
      DEN
      DFW
      IAH
      LAS
      LAX
      LGA
      ORD
      PHX
      SFO
    
    
      origin_airport
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      ATL
      NaN
      0.000000
      0.0
      0.0
      0.003636
      0.0
      0.006745
      0.0
      0.005208
      0.0
    
    
      DEN
      0.003891
      NaN
      NaN
      NaN
      NaN
      NaN
      0.011905
      NaN
      NaN
      NaN
    
    
      DFW
      0.002525
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      IAH
      0.004032
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      LAS
      0.010309
      NaN
      NaN
      NaN
      NaN
      0.0
      NaN
      NaN
      NaN
      NaN
    
    
      LAX
      0.008929
      NaN
      NaN
      NaN
      0.000000
      NaN
      NaN
      NaN
      0.000000
      0.0
    
    
      LGA
      0.001529
      0.013514
      0.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      ORD
      0.002786
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      PHX
      0.005556
      NaN
      NaN
      NaN
      NaN
      0.0
      NaN
      NaN
      NaN
      NaN
    
    
      SFO
      0.000000
      NaN
      NaN
      NaN
      NaN
      0.0
      NaN
      NaN
      NaN
      NaN



In [43]:

    
plt.figure(figsize=(16, 10))
sns.heatmap(travel_routes.groupby(['origin_airport', 'destination_airport']).diverted.mean().unstack());

7. Create and fill a column for mean ground speed for every flight in miles/hour

Do any of these seem unusual?
(distance column is in miles)



In [24]:

    
flights['ground_speed'] = flights.distance * 1.60934 / flights.air_time * 60

print(flights.sort_values(by='ground_speed').ground_speed.head())
print(flights.sort_values(by='ground_speed', ascending=False).ground_speed.head())

print(flights.iloc[203450])
print(flights.iloc[546104])









    



546104    5.724117
497055    5.728159
200234    5.728159
571599    5.732207
215116    5.736261
Name: ground_speed, dtype: float64
203450    inf
172729    inf
125225    inf
22891     inf
433556    inf
Name: ground_speed, dtype: float64
year                                  2015
month                                    7
day                                     19
day_of_week                              7
airline                                 OO
flight_number                         2987
tail_number                         N897SK
origin_airport                         ABQ
destination_airport                    PHX
scheduled_departure    2015-07-19 18:35:00
departure_time         2015-07-19 18:46:00
departure_delay                         11
taxi_out                                18
wheels_off             2015-07-19 19:04:00
scheduled_time                          93
elapsed_time                            89
air_time                                 0
distance                               328
wheels_on              2015-07-19 19:04:00
taxi_in                                 11
scheduled_arrival      2015-07-19 19:08:00
arrival_time           2015-07-19 19:15:00
arrival_delay                            7
diverted                                 0
cancelled                                0
cancellation_reason                    NaN
air_system_delay                       NaN
security_delay                         NaN
airline_delay                          NaN
late_aircraft_delay                    NaN
weather_delay                          NaN
ground_speed                           inf
Name: 203450, dtype: object
year                                  2015
month                                   11
day                                     18
day_of_week                              3
airline                                 OO
flight_number                         4995
tail_number                         N703SK
origin_airport                         SBN
destination_airport                    ORD
scheduled_departure    2015-11-18 06:59:00
departure_time         2015-11-18 06:52:00
departure_delay                         -7
taxi_out                                91
wheels_off             2015-11-18 08:23:00
scheduled_time                          48
elapsed_time                           135
air_time                              1417
distance                                84
wheels_on              2015-11-18 08:00:00
taxi_in                                  7
scheduled_arrival      2015-11-18 06:47:00
arrival_time           2015-11-18 08:07:00
arrival_delay                           80
diverted                                 0
cancelled                                0
cancellation_reason                    NaN
air_system_delay                        80
security_delay                           0
airline_delay                            0
late_aircraft_delay                      0
weather_delay                            0
ground_speed                       5.72412
Name: 546104, dtype: object

There are some infinite values due to an airtime of 0 (we have some data cleaning to do I would say) and some very low. It seems that using airtime and distance isn't a good way to measure the ground speed in case of delays for example.

8. Create a visual for the mean times of delay reasons for each airline. Create one for delay proportions for each airline.



In [45]:

    
delays = flights.groupby('airline')['air_system_delay', 'security_delay', 'airline_delay', 'late_aircraft_delay', 'weather_delay'].mean().unstack().reset_index()
delays.columns = ['reason', 'airline', 'mean_time']

fig = plt.figure(figsize=(16, 8))
ax = sns.barplot(x='airline', y='mean_time', data=delays, hue='reason');

I interpreted the delay proportions as the proportion of delays over the total number of flights:



In [26]:

    
delay_proportion = (flights.groupby('airline').airline_delay.count() / flights.groupby('airline').year.count()).reset_index()
delay_proportion.columns = ['airline', 'delay_proportion']

fig = plt.figure(figsize=(10, 10))
ax = sns.barplot(x='delay_proportion', y='airline', data=delay_proportion);

This is the way it is done in the solutions:



In [56]:

    
ax = delays.pivot_table(columns='reason', index='airline', values='mean_time').plot.barh(figsize=(16, 10), stacked=True);

9. Plot the delay reasons by airline using a parallel coordinates plot for the top 3 airlines

this can take a while so use a random sample of 10% of the data



In [27]:

    
flights_sample = flights[flights.airline.isin(['AA', 'DL', 'WN'])].sample(frac=0.1)

plt.figure(figsize=(16, 10))
pd.tools.plotting.parallel_coordinates(flights_sample[['air_system_delay', 'security_delay', 'airline_delay', 'late_aircraft_delay', 'weather_delay', 'airline']], 'airline');

Exponential distribution property

The exponential distribution has a special property that is referred to as the memoryless property. In terms of time, it states that the probability of an exponential random variable exceeding some value does not depend on how much time has already elapsed. Mathematically it looks like

$$P(X > s + t | X > t) = P(X > s)$$

Where $t$ is the amount of time already elapsed.

Time series

Pandas offers easy to use tools for working with time series. A couple of these tools include resampling the frequency (downsampling and upsampling), and easily plotting time series.

Resampling
- pandas has a resample() method that let's us easily downsample and upsample time series
- downsampling is the process of reducing the frequency of a time series
  - an example of this is changing a time series from hourly to daily or weekly
  - anlalyze groups by day where individuals have different time stamps by time of day
- upsampling is the opposite, where frequency is increased
  - this requires some form of interpolation
  - upsampling won't be used here
Plotting
- plotting time series data with pandas is as easy as all other plotting methods, only we need to set the time series of interest as the index (if it is currently a column).
- we can use the set_index() method to do this.

1. Create a plot of min (blue dotted), max (red dotted), and mean (black) temperature over time.

Give the plot a title, and name the x-axis "Date" and the y-axis "Temp (F)"



In [57]:

    
cycle = pd.read_csv('data/trip_cleaned.csv')



In [4]:

    
cycle.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236065 entries, 0 to 236064
Data columns (total 36 columns):
trip_id                       236065 non-null int64
starttime                     236065 non-null object
stoptime                      236065 non-null object
bikeid                        236065 non-null object
tripduration                  236065 non-null float64
from_station_name             236065 non-null object
to_station_name               236065 non-null object
from_station_id               236065 non-null object
to_station_id                 236065 non-null object
usertype                      236065 non-null object
gender                        146171 non-null object
birthyear                     146167 non-null float64
Max_Temperature_F             236065 non-null int64
Mean_Temperature_F            236065 non-null float64
Min_TemperatureF              236065 non-null int64
Max_Dew_Point_F               236065 non-null int64
MeanDew_Point_F               236065 non-null int64
Min_Dewpoint_F                236065 non-null int64
Max_Humidity                  236065 non-null int64
Mean_Humidity                 236065 non-null int64
Min_Humidity                  236065 non-null int64
Max_Sea_Level_Pressure_In     236065 non-null float64
Mean_Sea_Level_Pressure_In    236065 non-null float64
Min_Sea_Level_Pressure_In     236065 non-null float64
Max_Visibility_Miles          236065 non-null int64
Mean_Visibility_Miles         236065 non-null int64
Min_Visibility_Miles          236065 non-null int64
Max_Wind_Speed_MPH            236065 non-null int64
Mean_Wind_Speed_MPH           236065 non-null int64
Max_Gust_Speed_MPH            0 non-null float64
Precipitation_In              236065 non-null float64
Events                        85826 non-null object
from_lat                      236060 non-null float64
from_lon                      236060 non-null float64
to_lat                        236045 non-null float64
to_long                       236045 non-null float64
dtypes: float64(12), int64(14), object(10)
memory usage: 64.8+ MB



In [58]:

    
temperatures = cycle[['starttime', 'Min_TemperatureF', 'Mean_Temperature_F', 'Max_Temperature_F']].copy()
temperatures['starttime'] = pd.to_datetime(temperatures.starttime)
temperatures.set_index('starttime', inplace=True)



In [59]:

    
fig = plt.figure(figsize=(16, 10))
ax = temperatures.Min_TemperatureF.plot(kind='line', style=':', color='b', label='Min')
ax = temperatures.Max_Temperature_F.plot(kind='line', style=':', color='r', label='Max')
ax = temperatures.Mean_Temperature_F.plot(kind='line', style='-', color='k', label='Mean')
ax.set_title('Temperature variations')
ax.set_xlabel('Date')
ax.set_ylabel('Temp (F)')
ax.legend();

2. Resample the temperature data so that the series includes only one row per day (use the mean), and plot a rolling mean of this resampled data. Use a windows size of 7 days.



In [60]:

    
temperatures_daily = temperatures.resample('D').mean()



In [61]:

    
temperatures_roll = temperatures_daily.rolling(7, center=True).mean().dropna()



In [62]:

    
fig = plt.figure(figsize=(16, 10))
ax = temperatures_roll.Min_TemperatureF.plot(kind='line', style=':', color='b', label='Min')
ax = temperatures_roll.Max_Temperature_F.plot(kind='line', style=':', color='r', label='Max')
ax = temperatures_roll.Mean_Temperature_F.plot(kind='line', style='-', color='k', label='Mean')
ax.set_title('Temperature 7-days rolling mean')
ax.set_xlabel('Date')
ax.set_ylabel('Temp (F)')
ax.legend();

3. Plot the average number of trips per hour for the month of July (this count will be for all stations)



In [66]:

    
trips = cycle[['trip_id', 'starttime', 'tripduration', 'from_station_name', 'usertype']].copy()
trips['starttime'] = pd.to_datetime(trips.starttime)
# trips.set_index('starttime', inplace=True)



In [6]:

    
trips_hourly = trips[trips.starttime.dt.month==7].set_index('starttime')
trips_hourly = trips_hourly.resample('H').trip_id.count().reset_index()
trips_hourly = trips_hourly.groupby(trips_hourly.starttime.dt.hour).trip_id.sum() / 31



In [7]:

    
fig = plt.figure(figsize=(16, 10))
ax = trips_hourly.plot(kind='line')
ax.set_title('Average number of trips in July')
ax.set_xlabel('Hour')
ax.set_ylabel('Average number of trips')
ax.xaxis.set_major_locator(plt.MaxNLocator(12));

4. Plot the average number of trips per hour, per station for the month of July.



In [8]:

    
trips_hour_station = trips[trips.starttime.dt.month==7].pivot_table(index='starttime', columns='from_station_name', values='trip_id', aggfunc='count')
trips_hour_station = trips_hour_station.resample('H').count()#.reset_index()
trips_hour_station = trips_hour_station.groupby(trips_hour_station.index.hour).sum() / 31



In [9]:

    
ax = trips_hour_station.iloc[:,0:10].plot(kind='line', figsize=(16, 10))
ax.set_title('Average number of trips per station in July')
ax.set_xlabel('Hour')
ax.set_ylabel('Average number of trips')
ax.set_xticks(range(0, 23, 2));

5. Show the distribution of trip durations for the month of May using a histogram with a KDE overlay.



In [67]:

    
may = trips[trips.starttime.dt.month==5]



In [68]:

    
fig = plt.figure(figsize=(16, 10))
ax = sns.distplot(may.tripduration);









    



C:\Users\alessandro.diantonio\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

6. How would you describe this distribution? Determine the following

What is the expected trip duration in May?
For the month of May, what is the probability that an individual trip will last for at least 20 minutes?
If this trip is currently at 14 minutes, what is the probability that it will last for at least another 20 minutes?
If the trip is currently at 16 minutes, what is the probability that it will last at least a total of 20 minutes?

I would describe this distribution as exponential, the expected value and $P(X>1200)$ are



In [69]:

    
print(may.tripduration.mean())
print(stats.expon.sf(1200, scale=may.tripduration.mean()))









    



1305.628519300941
0.398878838765

The probability for the second case is the same as the first because of the memorilessness of the exponentiali distribution.

The probability for the third case is $P(X > 4 \cdot 60)$:



In [70]:

    
print(stats.expon.sf(240, scale=may.tripduration.mean()))









    



0.832085970068

7. Use pandas to create side-by-side box plots for the trip durations for each station throughout the months of Jun, July, Aug.



In [82]:

    
ax = trips[trips.starttime.dt.month.isin([6, 7, 8])].boxplot(column='tripduration', by='from_station_name', figsize=(18,10), rot=90)
ax.set_ylim(0, 8000)
ax.set_title('Trip duration by station in June, July and August')
ax.set_xlabel('Station');

8. Use split violin plots to show the differences between usertype for the distributions of tripduration for each year



In [87]:

    
plt.figure(figsize=(16, 8))
ax = sns.violinplot(x=trips.starttime.dt.year, y=trips.tripduration, hue=trips.usertype, split=True)

9. Show precipitation and humidity over the course of 2016 in one figure.



In [98]:

    
water = cycle[['starttime', 'Precipitation_In', 'Mean_Humidity']].copy()
water['starttime'] = pd.to_datetime(water.starttime)
water = water[water.starttime.dt.year==2016]
water.set_index('starttime', inplace=True)



In [103]:

    
ax = water.resample('D').Precipitation_In.mean().plot(label='Mean Precipitation', figsize=(16, 8), legend=True)
ax = water.resample('D').Mean_Humidity.mean().plot(secondary_y=True, label='Mean Humidity', legend=True)
ax.set_xlabel('Date')
ax.set_title('Precipitation and humidity in 2016');

cancelled	0	1
airline
AA	64088	10194
DL	79521	3807
WN	114520	15721

	Id	Age	Job	Marital	Education	Balance	HHInsurance	Communication	LastContactDay	LastContactMonth	NoOfContacts	DaysPassed	PrevAttempts	Outcome	CallStart	CallEnd	CarInsurance
0	1	32	management	single	tertiary	1218	1	telephone	28	jan	2	-1	0	NaN	13:45:20	13:46:30	0
1	2	32	blue-collar	married	primary	1156	1	NaN	26	may	5	-1	0	NaN	14:49:03	14:52:08	0
2	3	29	management	single	tertiary	637	1	cellular	3	jun	1	119	1	failure	16:30:24	16:36:04	1
3	4	25	student	single	primary	373	1	cellular	11	may	2	-1	0	NaN	12:06:43	12:20:22	1
4	5	30	management	married	tertiary	2694	0	cellular	3	jun	1	-1	0	NaN	14:35:44	14:38:56	0

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Id	4000	NaN	NaN	NaN	2000.5	1154.84	1	1000.75	2000.5	3000.25	4000
Age	4000	NaN	NaN	NaN	41.2148	11.5502	18	32	39	49	95
Job	3981	11	management	893	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Marital	4000	3	married	2304	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Education	3831	3	secondary	1988	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Default	4000	NaN	NaN	NaN	0.0145	0.119555	0	0	0	0	1
Balance	4000	NaN	NaN	NaN	1532.94	3511.45	-3058	111	551.5	1619	98417
HHInsurance	4000	NaN	NaN	NaN	0.49275	0.50001	0	0	0	1	1
CarLoan	4000	NaN	NaN	NaN	0.133	0.339617	0	0	0	0	1
Communication	3098	2	cellular	2831	NaN	NaN	NaN	NaN	NaN	NaN	NaN
LastContactDay	4000	NaN	NaN	NaN	15.7212	8.42531	1	8	16	22	31
LastContactMonth	4000	12	may	1049	NaN	NaN	NaN	NaN	NaN	NaN	NaN
NoOfContacts	4000	NaN	NaN	NaN	2.60725	3.0642	1	1	2	3	43
DaysPassed	4000	NaN	NaN	NaN	48.7065	106.685	-1	-1	-1	-1	854
PrevAttempts	4000	NaN	NaN	NaN	0.7175	2.07865	0	0	0	0	58
Outcome	958	3	failure	437	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CallStart	4000	3777	15:48:27	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CallEnd	4000	3764	11:27:46	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CarInsurance	4000	NaN	NaN	NaN	0.401	0.490162	0	0	0	1	1

Outcome	failure	success
Job
admin.	56.0	33.0
blue-collar	76.0	33.0
entrepreneur	13.0	2.0
housemaid	9.0	2.0
management	108.0	94.0
retired	16.0	43.0
self-employed	9.0	11.0
services	38.0	22.0
student	15.0	18.0
technician	77.0	44.0
unemployed	18.0	21.0

	cancelled_rate	diverted_rate	flight_volume	flight_proportion
WN	0.120707	0.002104	130241	0.212961
DL	0.045687	0.002052	83328	0.136252
AA	0.137234	0.002302	74282	0.121461
EV	0.223993	0.002353	65868	0.107703
OO	0.155614	0.001777	62456	0.102124
UA	0.122797	0.001715	52485	0.085820
MQ	0.365076	0.001336	40408	0.066072
B6	0.147032	0.002482	28198	0.046107
US	0.173152	0.000809	23488	0.038406
AS	0.039354	0.001609	16161	0.026425
NK	0.154378	0.000874	12586	0.020580
F9	0.065410	0.000908	8806	0.014399
HA	0.022210	0.000978	7159	0.011706
VX	0.083702	0.001474	6105	0.009982

	expected_time_between	number_cancellations	number_flights
origin_airport
ABE	7.805553e+00	3.0	36.0
ABQ	1.283332e+00	1.0	107.0
AGS	1.875000e+07	0.0	71.0
ALB	3.599999e+00	3.0	112.0
ANC	0.000000e+00	0.0	207.0
ATL	1.539162e-02	915.0	22914.0
ATW	0.000000e+00	1.0	50.0
AUS	1.980000e+00	10.0	340.0
AVL	2.545833e+00	4.0	57.0
AVP	0.000000e+00	0.0	14.0
BDL	0.000000e+00	28.0	375.0
BGR	1.296667e+07	0.0	19.0
BHM	1.375000e+00	12.0	298.0
BIL	3.333333e+05	0.0	88.0
BIS	2.250000e+07	0.0	22.0
BNA	5.061728e-02	27.0	506.0
BOI	8.050000e+06	0.0	104.0
BOS	2.431611e-02	329.0	1698.0
BTR	2.900000e+06	0.0	34.0
BTV	1.666665e-02	1.0	11.0
BUF	7.988886e+00	3.0	236.0
BWI	3.181818e-01	22.0	698.0
BZN	1.900000e+07	0.0	111.0
CAE	1.500000e+00	3.0	88.0
CAK	2.041666e-01	4.0	148.0
CHA	3.819444e+00	6.0	71.0
CHO	7.833331e+00	3.0	35.0
CHS	7.017543e-02	19.0	351.0
CLE	2.997916e+00	8.0	233.0
CLT	1.715151e+00	11.0	611.0
...	...	...	...
ROA	1.170833e+01	2.0	49.0
ROC	3.062499e+00	4.0	128.0
RSW	1.812500e-01	24.0	591.0
SAN	1.478788e+00	11.0	624.0
SAT	2.700000e+00	8.0	299.0
SAV	2.884615e-01	13.0	299.0
SDF	0.000000e+00	15.0	274.0
SEA	8.564102e-01	13.0	1367.0
SFO	2.480620e-01	43.0	1065.0
SGF	8.333325e-02	1.0	32.0
SHV	2.941665e+00	2.0	30.0
SJC	0.000000e+00	4.0	152.0
SJU	7.144927e-01	23.0	323.0
SLC	3.671171e-01	37.0	3381.0
SMF	6.888887e-01	3.0	158.0
SNA	3.499999e-01	5.0	268.0
SRQ	6.749999e-01	6.0	225.0
STL	9.791665e-01	8.0	425.0
STT	1.447222e+00	6.0	104.0
STX	1.150000e+06	0.0	2.0
SYR	0.000000e+00	2.0	124.0
TLH	0.000000e+00	5.0	128.0
TPA	8.018868e-02	53.0	1065.0
TRI	7.944442e+00	3.0	45.0
TUL	4.027776e+00	3.0	105.0
TUS	1.833333e-01	3.0	83.0
TVC	2.391667e+07	0.0	38.0
TYS	2.952777e+00	6.0	106.0
VPS	2.503703e+00	9.0	184.0
XNA	7.499996e+00	2.0	49.0

	origin_airport	year	month	expected_time_between	number_cancellations	number_flights
0	ATL	2015	1	1.400000e-01	90.0	1894.0
1	ATL	2015	2	2.863512e-02	486.0	2191.0
2	ATL	2015	3	2.256410e-01	65.0	2113.0
3	ATL	2015	4	3.579545e-01	44.0	2084.0
4	ATL	2015	5	3.098214e-01	56.0	2088.0
5	ATL	2015	6	1.724138e-01	87.0	2097.0
6	ATL	2015	7	1.306944e+00	12.0	2187.0
7	ATL	2015	8	1.356060e+00	11.0	2197.0
8	ATL	2015	9	1.551667e+07	0.0	2063.0
9	ATL	2015	11	7.583330e+00	2.0	2041.0
10	ATL	2015	12	2.392473e-01	62.0	1959.0
11	DEN	2015	1	6.111110e-01	6.0	65.0
12	DEN	2015	2	7.124999e-01	12.0	63.0
13	DEN	2015	3	4.266666e+00	5.0	84.0
14	DEN	2015	4	1.724998e+01	1.0	64.0
15	DEN	2015	5	1.108332e+01	1.0	69.0
16	DEN	2015	6	5.666667e+06	0.0	73.0
17	DEN	2015	7	1.583333e+06	0.0	69.0
18	DEN	2015	8	1.016667e+07	0.0	69.0
19	DEN	2015	9	5.083333e+06	0.0	65.0
20	DEN	2015	11	1.000000e+07	0.0	62.0
21	DEN	2015	12	1.933331e+01	1.0	68.0
22	DFW	2015	1	2.041667e+07	0.0	35.0
23	DFW	2015	2	8.809523e-01	7.0	61.0
24	DFW	2015	3	4.316667e+06	0.0	51.0
25	DFW	2015	4	4.066663e+00	1.0	45.0
26	DFW	2015	5	8.916658e+00	1.0	45.0
27	DFW	2015	6	9.549995e+00	2.0	53.0
28	DFW	2015	7	2.183331e+01	1.0	44.0
29	DFW	2015	8	6.499994e+00	1.0	59.0
...	...	...	...	...	...	...
80	ORD	2015	4	0.000000e+00	3.0	57.0
81	ORD	2015	5	7.041663e+00	2.0	71.0
82	ORD	2015	6	2.569444e+00	6.0	72.0
83	ORD	2015	7	0.000000e+00	1.0	57.0
84	ORD	2015	8	3.033333e+06	0.0	69.0
85	ORD	2015	9	1.158333e+07	0.0	72.0
86	ORD	2015	11	7.483333e+06	0.0	71.0
87	ORD	2015	12	6.249998e-01	4.0	57.0
88	PHX	2015	1	4.449998e+00	2.0	46.0
89	PHX	2015	2	5.112499e+00	4.0	59.0
90	PHX	2015	3	5.527776e+00	3.0	67.0
91	PHX	2015	4	2.583331e+00	1.0	64.0
92	PHX	2015	5	4.616667e+06	0.0	73.0
93	PHX	2015	6	2.216664e+01	1.0	55.0
94	PHX	2015	7	1.666667e+07	0.0	69.0
95	PHX	2015	8	1.335000e+07	0.0	69.0
96	PHX	2015	9	1.516667e+07	0.0	56.0
97	PHX	2015	11	3.633333e+06	0.0	66.0
98	PHX	2015	12	7.133326e+00	1.0	74.0
99	SFO	2015	1	8.020832e-01	8.0	71.0
100	SFO	2015	2	3.437500e-01	8.0	72.0
101	SFO	2015	3	6.818181e-01	11.0	74.0
102	SFO	2015	4	1.149999e+01	2.0	88.0
103	SFO	2015	5	1.673333e+07	0.0	82.0
104	SFO	2015	6	4.020832e+00	4.0	112.0
105	SFO	2015	7	5.388887e+00	3.0	116.0
106	SFO	2015	8	1.849998e+01	1.0	128.0
107	SFO	2015	9	1.658333e+07	0.0	102.0
108	SFO	2015	11	8.149992e+00	1.0	112.0
109	SFO	2015	12	2.150000e+00	5.0	108.0

destination_airport	ATL	DEN	DFW	IAH	LAS	LAX	LGA	ORD	PHX	SFO
origin_airport
ATL	NaN	0.000000	0.0	0.0	0.003636	0.0	0.006745	0.0	0.005208	0.0
DEN	0.003891	NaN	NaN	NaN	NaN	NaN	0.011905	NaN	NaN	NaN
DFW	0.002525	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
IAH	0.004032	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
LAS	0.010309	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN
LAX	0.008929	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	0.000000	0.0
LGA	0.001529	0.013514	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ORD	0.002786	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
PHX	0.005556	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN
SFO	0.000000	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN

EDA: Car Insurance Cold Calls

Explore the data

Exponential distributions

Chi-square tests of independence

Problems

1. Compute descriptive statistics for every feature in the data

Create call_duration column

With and Without Car Insurance

2. a) Plot a histogram of call durations using 25 bins

2. b) Expected Call Duration

3. Estimate the exponential distribution and compute the mean, median, and the 95th percentile

What are these values for the two groups: has car insurance, and doesn't have car insurance?

4. Write a function to select only features that could be binomial random variables and select one of them to

Write a function to select only features that could be binomial random variables

Compute P(X = 3)

Compute P(X &leq 3)

5. Write a function to create a column that categorizes each age according to the following age ranges:

6. Create a stacked bar chart showing the frequency of each age group. What are the most and least common groups?

Now plot and describe the distribution of ages

7. Create contingency tables for the following pairs:

8. Transform these contingency tables to show proportions of Outcome by each independent variable (Education, etc.). Plot these tables as heatmaps.

9. Run Chi-square tests for independence (alpha = 0.05) on all the contingency tables and interpret the results.

10. Which of these contingency tables are not appropriate for a Chi-square test? Why not?

Redo the tests using only "failure" and "success" outcomes, and interpret the results. Is the Job table still not appropriate for chi-square?

EDA: Flight Cancellations & Delays

Explore the data

Possion distributions / Poisson process

Logarithmic transformations

Problems

1. Use pandas apply() method to create a table that contains the following stats for each airline:

2. Compute the volume of activity by airport and airline/airport pairs and do the following:

3. Determine expected times between cancellations for an airline of your choosing.

4. Model the time between cancellations as a Poisson distribution

5. Is there a significant difference between the three most active airlines for overall cancellation rates?

6. Show the relationship between travel routes (origin -> destination) and diversion rates for Delta Air and the top 10 airports from earlier

7. Create and fill a column for mean ground speed for every flight in miles/hour

8. Create a visual for the mean times of delay reasons for each airline. Create one for delay proportions for each airline.

9. Plot the delay reasons by airline using a parallel coordinates plot for the top 3 airlines

EDA: Cycle Share

Exponential distribution property

Time series

1. Create a plot of min (blue dotted), max (red dotted), and mean (black) temperature over time.

2. Resample the temperature data so that the series includes only one row per day (use the mean), and plot a rolling mean of this resampled data. Use a windows size of 7 days.

3. Plot the average number of trips per hour for the month of July (this count will be for all stations)

4. Plot the average number of trips per hour, per station for the month of July.

5. Show the distribution of trip durations for the month of May using a histogram with a KDE overlay.

6. How would you describe this distribution? Determine the following

7. Use pandas to create side-by-side box plots for the trip durations for each station throughout the months of Jun, July, Aug.

8. Use split violin plots to show the differences between usertype for the distributions of tripduration for each year

9. Show precipitation and humidity over the course of 2016 in one figure.

1. Use pandas `apply()` method to create a table that contains the following stats for each airline: