Problem 4.2. Simple Statistics II.

This problem is a continuation of problem 4.1. Recall that you wrote a function named get_stats() that takes a list and returns a tuple of minimum, maximum, mean, and median. To use this function, you have to convert your IPython notebook to a regular .py file. One way to do this is to use the IPython %%script%% magic function; open up a new notebook, and in an IPython notebook cell, type (assuming the filename of your IPython notebook from Problem 4.1 is stats.ipynb):

%%bash
ipython3 nbconvert --to python stats.ipynb

and press shift + enter. This will create a Python script named stats.py. We will import this as a module in stats2.ipynb:


In [ ]:
from stats import get_stats

We will use the function get_stats() to compute basic statistics of a number of columns from the arline performance dataset we downloaded in week 2. Namely, we will use the following columns:

  • Column 15, "ArrDelay": arrival delay, in minutes,
  • Column 16, "DepDelay": departure delay, in minutes, and
  • Column 19, "Distance": distance, in miles.

To extract these columns from the CSV file,

  • Write a function named get_column(filename, n, header = True) that reads the n-th column from a file and returns a list of integers.

  • You may assume that the column is made of integers.

  • We will also use the optional argument header because the first line of our file lists the names of the columns, but we might want to turn this off to handle a file that doesn't have a header.

  • Use a combination of with statement and open() function to open filename in the get_column() function.

    Tip: When I tried to use open() to read 2001.csv, I had the following error:

      'utf-8' codec can't decode byte 0xe4 in position 343: invalid continuation byte
    
    

    You can avoid this error by using encoding='latin-1' option in open().

  • Skip the first line if the header parameter is True; do not skip if it's False.

  • Some columns have missing values 'NA', and you need a way to handle these missing values. If the n-th column is missing, you should not include that column in result; that is, skip all rows with 'NA'. As a result, lists returned from different columns may have different lengths.


In [ ]:
def get_column(filename, n, header=True):
    '''
    Returns a list from reading the specified column in the CSV file.

    Parameters
    __________
    filename(str): Input file name in Comma Separated Values (CSV) format
    n(int): Column number. The first column starts at 0. The column must be
            a list of integers.
    header(bool): If True, the first line of file is column names.
                  Default: True.

    Examples
    ________
    >>> get_column('/data/airline/2001.csv', 14)[:10]
    [-3, 4, 23, 10, 20, -3, -10, -12, -9, -1]
    >>> get_column('/data/airline/2001.csv', 15)[-10:]
    [-4, -5, -8, 4, -7, 4, 8, -4, -4, 9]
    '''
    result = []
    
    # your code goes here
    
    return result

We also want to print out the results in a nicely formatted manner.

  • The print_stats(input_list, title=None) function is already written for you. You don't need to write this function.

It takes a list of integers and prints out the basic statistics.


In [ ]:
def print_stats(input_list, title=None):
    '''
    Computes minimum, maximum, mean, and median using get_stats function from
      stats module, and prints them out in a nice format.

    Parameters:
      input_list(list): a list representing a column
      title(str): Optional. If given, title is printed out before the stats.

    Examples:
    >>> print_stats(list(range(50)))
    Minimum: 0
    Maximum: 49
    Mean: 24.5
    Median: 24.5
    >>> print_stats(list(range(100)), title = 'Stats!')
    Stats!
    Minimum: 0
    Maximum: 99
    Mean: 49.5
    Median: 49.5
    '''
    if title is not None:
        print(title)
        
    minimum, maximum, mean, median = get_stats(input_list)
    print('Minimum: {0}\n'
          'Maximum: {1}\n'
          'Mean: {2:.2f}\n'
          'Median: {3:.2f}'.format(minimum, maximum, mean, median))
    return None

When you run the following cell, you should get

Arrival delay, in minutes.
Minimum: -1116
Maximum: 1688
Mean: 5.53
Median: -2.00

In [ ]:
#warning: this could take a while.
filename = '/data/airline/2001.csv' # 2001 airline on-time performance dataset

arr_delay = get_column(filename, 14)
print_stats(arr_delay, "Arrival delay, in minutes.")

When you run the following cell, you should get

Departure delay, in minutes.
Minimum: -204
Maximum: 1692
Mean: 8.15
Median: 0.00

In [ ]:
#warning: this could take a while.
dep_delay = get_column(filename, 15)
print_stats(dep_delay, "Departure delay, in minutes.")

When you run the following cell, you should get

Distance, in miles.
Minimum: 21
Maximum: 4962
Mean: 733.03
Median: 571.00

In [ ]:
#warning: this could take a while.
distance = get_column(filename, 18)
print_stats(distance, "Distance, in miles.")

In [ ]: