In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
First, write a function named get_cdf()
that takes an array
and returns a tuple that represents the $x$ and $y$ axes of the (empirical) CDF.
Here is a very easy algorithm to implement this function. See the definition of empirical distribution function on Wikipedia. That means we can do the following to produce the empirical CDF:
'NA'
with
numpy.nan
(Not A Number).
Use numpy.isfinite()
to mask out these missing values.numpy.sort()
to sort the array (with no missing values) in ascending order.
This will be our $x$-axis.np.arange()
to make an array
of length $N$,
and divide each element by $N$.According to Wikipedia, the resulting empirical CDF is an unbiased estimator for the true CDF.
Note: Do NOT use numpy.histogram() function to create a CDF. It uses binning, which might be useful in other cases but not in this case. The method I outlined above is a better characterization of the true CDF.
In [ ]:
#%%writefile FirstName_LastName_cdf.py
def get_cdf(filename, column):
'''
Reads a specific column of airline on-time performance CSV file,
and returns a tuple of arrays that represent the x and y axes of
cumulative distribution function.
Parameters
----------
filename(str): The filename of a CSV file.
column(str): The column header.
Returns
-------
A tuple of two numpy arrays of equal length.
The first array represents the x axis of CDF.
The second array represents the y axis of CDF.
'''
# your code goes here
return cdf_x, cdf_y
The %%writefile
magic will create a file named FirstName_LastName_cdf.py
.
Rename and submit this file along with your .ipynb
file.
Next, use the get_cdf()
function to create a CDF plot.
Your plot should show both the empirical CDF calculated
from ArrDelay
column of 2001.csv
file
and the CDF of a Guassian distribution.
You can obtain the CDF of a Guassian distribution by using
scipy.stats.norm.cdf()
.
Use loc=-3.5
and scale=10.7
(which were obtained by fitting a Gaussian curve
on the PDF).
In [ ]:
# your code goes here