This problem is a continuation of problem 4.1. Recall that you wrote a function named get_stats()
that takes a list and returns a tuple of minimum, maximum, mean, and median. To use this function, you have to convert your IPython notebook to a regular .py
file. One way to do this is to use the IPython %%script%%
magic function; open up a new notebook, and in an IPython notebook cell, type (assuming the filename of your IPython notebook from Problem 4.1 is stats.ipynb
):
%%bash
ipython3 nbconvert --to python stats.ipynb
and press shift + enter. This will create a Python script
named stats.py
. We will import this as a module in stats2.ipynb
:
In [ ]:
from stats import get_stats
We will use the function get_stats()
to compute basic statistics of a number of columns from the arline performance dataset we downloaded in week 2. Namely, we will use the following columns:
To extract these columns from the CSV file,
Write a function named get_column(filename, n, header = True)
that reads the n
-th column from a file and returns a list of integers.
You may assume that the column is made of integers.
We will also use the optional argument header
because the first line of our file lists the names of the columns, but we might want to turn this off to handle a file that doesn't have a header.
Use a combination of with
statement and open()
function to open filename
in the get_column()
function.
Tip: When I tried to use open()
to read 2001.csv
, I had the following error:
'utf-8' codec can't decode byte 0xe4 in position 343: invalid continuation byte
You can avoid this error by using encoding='latin-1'
option in open()
.
Skip the first line if the header
parameter is True
; do not skip if it's False
.
Some columns have missing values 'NA'
, and you need a way to handle these
missing values. If the n
-th column is missing, you should not include
that column in result
; that is, skip all rows with 'NA'
.
As a result, lists returned from different
columns may have different lengths.
In [ ]:
def get_column(filename, n, header=True):
'''
Returns a list from reading the specified column in the CSV file.
Parameters
__________
filename(str): Input file name in Comma Separated Values (CSV) format
n(int): Column number. The first column starts at 0. The column must be
a list of integers.
header(bool): If True, the first line of file is column names.
Default: True.
Examples
________
>>> get_column('/data/airline/2001.csv', 14)[:10]
[-3, 4, 23, 10, 20, -3, -10, -12, -9, -1]
>>> get_column('/data/airline/2001.csv', 15)[-10:]
[-4, -5, -8, 4, -7, 4, 8, -4, -4, 9]
'''
result = []
# your code goes here
return result
We also want to print out the results in a nicely formatted manner.
print_stats(input_list, title=None)
function is already written for you.
You don't need to write this function.It takes a list of integers and prints out the basic statistics.
In [ ]:
def print_stats(input_list, title=None):
'''
Computes minimum, maximum, mean, and median using get_stats function from
stats module, and prints them out in a nice format.
Parameters:
input_list(list): a list representing a column
title(str): Optional. If given, title is printed out before the stats.
Examples:
>>> print_stats(list(range(50)))
Minimum: 0
Maximum: 49
Mean: 24.5
Median: 24.5
>>> print_stats(list(range(100)), title = 'Stats!')
Stats!
Minimum: 0
Maximum: 99
Mean: 49.5
Median: 49.5
'''
if title is not None:
print(title)
minimum, maximum, mean, median = get_stats(input_list)
print('Minimum: {0}\n'
'Maximum: {1}\n'
'Mean: {2:.2f}\n'
'Median: {3:.2f}'.format(minimum, maximum, mean, median))
return None
When you run the following cell, you should get
Arrival delay, in minutes.
Minimum: -1116
Maximum: 1688
Mean: 5.53
Median: -2.00
In [ ]:
#warning: this could take a while.
filename = '/data/airline/2001.csv' # 2001 airline on-time performance dataset
arr_delay = get_column(filename, 14)
print_stats(arr_delay, "Arrival delay, in minutes.")
When you run the following cell, you should get
Departure delay, in minutes.
Minimum: -204
Maximum: 1692
Mean: 8.15
Median: 0.00
In [ ]:
#warning: this could take a while.
dep_delay = get_column(filename, 15)
print_stats(dep_delay, "Departure delay, in minutes.")
When you run the following cell, you should get
Distance, in miles.
Minimum: 21
Maximum: 4962
Mean: 733.03
Median: 571.00
In [ ]:
#warning: this could take a while.
distance = get_column(filename, 18)
print_stats(distance, "Distance, in miles.")
In [ ]: