Homework 6

Use this notebook to work on your answers and check solutions. You can then submit your functions using "hw6_submission.ipynb" or directly write your functions in a file named "hw6_answers.py". Note that "hw6_answers.py" will be the only file collected and graded for this assignment.

For questions 1-3, you will use the APD dataset that we have been working with in class.

For questions 4-5, you will use data from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets.


In [6]:
# Loading python packages and APD data file (this step does not have to be included in hw6_answers.py)

import pandas as pd
import numpy as np

df = pd.read_csv('/home/data/APD/COBRA-YTD2017.csv.gz')

Question 1

Write a function called "variable_helper" which takes one argument:

  • df, which is a pandas data frame

and returns:

  • d, a dictionary where keys are the column names of df and values are one of "numeric", "categorical", "ordinal", "date/time", or "text", corresponding to the feature type of each column.

In [ ]:
#### play with code here #####

Sample output:

In [1]: variable_helper(df[['offense_id','beat','x','y']])
Out[1]: {'beat': 'categorical',
         'offense_id': 'ordinal',
         'x': 'numeric',
         'y': 'numeric'}

Short explanation: offense_id is a number assigned to each offense. There is a natural ordering implied in the id number (based on order of occurrence). Because of this, offense_id is an ordinal feature. The beat uses a numeric label, but refers to a geographic location. There is no natural ordering, so beat is a categorical feature. The location variables (x and y) are numeric position coordinates.

Question 2

Write a function called "get_categories" which takes one argument:

  • df, which is a pandas data frame

and returns:

  • cat, a dictionary where keys are names of columns of df corresponding to categorical features, and values are arrays of all the unique values that the feature can take.

In [ ]:
#### play with code here #####

Sample output:

In [1]: get_categories(df[['offense_id','beat','UC2 Literal']])
Out[1]: {'UC2 Literal': array(['AGG ASSAULT', 'AUTO THEFT', 'BURGLARY-NONRES',
                'BURGLARY-RESIDENCE', 'HOMICIDE', 'LARCENY-FROM VEHICLE',
                'LARCENY-NON VEHICLE', 'RAPE', 'ROBBERY-COMMERCIAL',
                'ROBBERY-PEDESTRIAN', 'ROBBERY-RESIDENCE'], dtype=object),
         'beat': array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
                114, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212,
                213, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
                313, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
                413, 414, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511,
                512, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612,
                701, 702, 703, 704, 705, 706, 707, 708, 710])}

Short explanation: UC2 Literal and beat are the only categorical variables in the data frame df[['offense_id','beat','UC2 Literal']].

Question 3

Write a function called "code_shift" which takes one argument:

  • df, which is a pandas data frame

and returns:

  • a pandas data frame with columns "offense_id", "Shift", "ShiftID", where ShiftID is 0 if "Shift" is "Unk", 1 if "Morn", 2 if "Day", and 3 if "Eve".

In [ ]:
#### play with code here #####

Sample output:

In [1]: code_shift(df[:5])
Out[1]:     offense_id  Shift   ShiftID
        0   172490115   Morn    1
        1   172490265   Eve     3
        2   172490322   Morn    1
        3   172490390   Morn    1
        4   172490401   Morn    1

For the last 2 questions, you will use the cereal data file available from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets. Execute the download and loading instructions below.

Please note

  1. download URL has changed!
  2. Don't download the files if you run your notebook on the cluster! Instead load the file from /home/data/cereal/cereal.csv

In [10]:
%%sh 
## RUN BUT DO NOT EDIT THIS CELL
## run this cell to download the cereal dataset into your current directory
if [ ! -f cereal.csv ]; then
    wget https://perso.telecom-paristech.fr/eagan/class/igr204/data/cereal.csv
fi
head cereal.csv


name;mfr;type;calories;protein;fat;sodium;fiber;carbo;sugars;potass;vitamins;shelf;weight;cups;rating
String;Categorical;Categorical;Int;Int;Int;Int;Float;Float;Int;Int;Int;Int;Float;Float;Float
100% Bran;N;C;70;4;1;130;10;5;6;280;25;3;1;0.33;68.402973
100% Natural Bran;Q;C;120;3;5;15;2;8;8;135;0;3;1;1;33.983679
All-Bran;K;C;70;4;1;260;9;7;5;320;25;3;1;0.33;59.425505
All-Bran with Extra Fiber;K;C;50;4;0;140;14;8;0;330;25;3;1;0.5;93.704912
Almond Delight;R;C;110;2;2;200;1;14;8;-1;25;3;1;0.75;34.384843
Apple Cinnamon Cheerios;G;C;110;2;2;180;1.5;10.5;10;70;25;1;1;0.75;29.509541
Apple Jacks;K;C;110;2;0;125;1;11;14;30;25;2;1;1;33.174094
Basic 4;G;C;130;3;2;210;2;18;8;100;25;3;1.33;0.75;37.038562

In [ ]:


In [7]:
## RUN BUT DO NOT EDIT THIS CELL

# load the data, define ratingID
cer = pd.read_csv('cereal.csv', skiprows=[1], delimiter=';')
cer.head()


Out[7]:
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
0 100% Bran N C 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
1 100% Natural Bran Q C 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
2 All-Bran K C 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1.0 0.75 34.384843

In [13]:
len(cer), cer.shape[0]


Out[13]:
(77, 77)

In [15]:
cer['ratingID'] = cer['rating'].apply(lambda x: 0 if x<60 else 1)

# define predicted ratingID
np.random.seed(12345)
cer['predicted_ratingID'] = (cer['rating']+20*np.random.randn(cer.shape[0])).apply(lambda x: 0 if x<60 else 1)
cer.head()


Out[15]:
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating ratingID predicted_ratingID
0 100% Bran N C 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973 1 1
1 100% Natural Bran Q C 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679 0 0
2 All-Bran K C 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505 0 0
3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912 1 1
4 Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1.0 0.75 34.384843 0 1

Question 4

Write a function called "rating_confusion" which takes one argument:

  • cer, which is a pandas data frame

and returns:

  • cf, a confusion matrix where the rows correspond to predicted_ratingID and the columns correspond to ratingID.

In [ ]:
#### play with code here #####




# Hint: look up pandas "crosstab"

Sample output:

In [1]: rating_confusion(cer[:20])
Out[1]: ratingID     0  1
predicted_ratingID      
                0   15  0
                1   3   2

Question 5

Write a function called "prediction_metrics" which takes one argument:

  • cer, which is a pandas data frame

and returns:

  • metrics_dict, a python dictionary object where the keys are 'precision', 'recall', 'F1' and the values are the numeric values for precision, recall, and F1 score, where ratingID is the prediction target and predicted_ratingID is a model output.

In [ ]:
#### play with code here #####

Sample output:

In [1]: prediction_metrics(cer[:20])
Out[1]: {'F1': 0.5714285714285715, 'precision': 0.4, 'recall': 1}