Homework 6

Use this notebook to work on your answers and check solutions. You can then submit your functions using "hw6_submission.ipynb" or directly write your functions in a file named "hw6_answers.py". Note that "hw6_answers.py" will be the only file collected and graded for this assignment.

For questions 1-3, you will use the APD dataset that we have been working with in class.

For questions 4-5, you will use data from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets.



In [6]:

    
# Loading python packages and APD data file (this step does not have to be included in hw6_answers.py)

import pandas as pd
import numpy as np

df = pd.read_csv('/home/data/APD/COBRA-YTD2017.csv.gz')

Question 1

Write a function called "variable_helper" which takes one argument:

df, which is a pandas data frame

and returns:

d, a dictionary where keys are the column names of df and values are one of "numeric", "categorical", "ordinal", "date/time", or "text", corresponding to the feature type of each column.



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: variable_helper(df[['offense_id','beat','x','y']])
Out[1]: {'beat': 'categorical',
         'offense_id': 'ordinal',
         'x': 'numeric',
         'y': 'numeric'}

Short explanation: offense_id is a number assigned to each offense. There is a natural ordering implied in the id number (based on order of occurrence). Because of this, offense_id is an ordinal feature. The beat uses a numeric label, but refers to a geographic location. There is no natural ordering, so beat is a categorical feature. The location variables (x and y) are numeric position coordinates.

Question 2

Write a function called "get_categories" which takes one argument:

df, which is a pandas data frame

and returns:

cat, a dictionary where keys are names of columns of df corresponding to categorical features, and values are arrays of all the unique values that the feature can take.



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: get_categories(df[['offense_id','beat','UC2 Literal']])
Out[1]: {'UC2 Literal': array(['AGG ASSAULT', 'AUTO THEFT', 'BURGLARY-NONRES',
                'BURGLARY-RESIDENCE', 'HOMICIDE', 'LARCENY-FROM VEHICLE',
                'LARCENY-NON VEHICLE', 'RAPE', 'ROBBERY-COMMERCIAL',
                'ROBBERY-PEDESTRIAN', 'ROBBERY-RESIDENCE'], dtype=object),
         'beat': array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
                114, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212,
                213, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
                313, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
                413, 414, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511,
                512, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612,
                701, 702, 703, 704, 705, 706, 707, 708, 710])}

Short explanation: UC2 Literal and beat are the only categorical variables in the data frame df[['offense_id','beat','UC2 Literal']].

Question 3

Write a function called "code_shift" which takes one argument:

df, which is a pandas data frame

and returns:

a pandas data frame with columns "offense_id", "Shift", "ShiftID", where ShiftID is 0 if "Shift" is "Unk", 1 if "Morn", 2 if "Day", and 3 if "Eve".



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: code_shift(df[:5])
Out[1]:     offense_id  Shift   ShiftID
        0   172490115   Morn    1
        1   172490265   Eve     3
        2   172490322   Morn    1
        3   172490390   Morn    1
        4   172490401   Morn    1

For the last 2 questions, you will use the cereal data file available from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets. Execute the download and loading instructions below.

Please note

download URL has changed!
Don't download the files if you run your notebook on the cluster! Instead load the file from /home/data/cereal/cereal.csv



In [10]:

    
%%sh 
## RUN BUT DO NOT EDIT THIS CELL
## run this cell to download the cereal dataset into your current directory
if [ ! -f cereal.csv ]; then
    wget https://perso.telecom-paristech.fr/eagan/class/igr204/data/cereal.csv
fi
head cereal.csv









    



name;mfr;type;calories;protein;fat;sodium;fiber;carbo;sugars;potass;vitamins;shelf;weight;cups;rating
String;Categorical;Categorical;Int;Int;Int;Int;Float;Float;Int;Int;Int;Int;Float;Float;Float
100% Bran;N;C;70;4;1;130;10;5;6;280;25;3;1;0.33;68.402973
100% Natural Bran;Q;C;120;3;5;15;2;8;8;135;0;3;1;1;33.983679
All-Bran;K;C;70;4;1;260;9;7;5;320;25;3;1;0.33;59.425505
All-Bran with Extra Fiber;K;C;50;4;0;140;14;8;0;330;25;3;1;0.5;93.704912
Almond Delight;R;C;110;2;2;200;1;14;8;-1;25;3;1;0.75;34.384843
Apple Cinnamon Cheerios;G;C;110;2;2;180;1.5;10.5;10;70;25;1;1;0.75;29.509541
Apple Jacks;K;C;110;2;0;125;1;11;14;30;25;2;1;1;33.174094
Basic 4;G;C;130;3;2;210;2;18;8;100;25;3;1.33;0.75;37.038562



In [ ]:



In [7]:

    
## RUN BUT DO NOT EDIT THIS CELL

# load the data, define ratingID
cer = pd.read_csv('cereal.csv', skiprows=[1], delimiter=';')
cer.head()









    Out[7]:






  
    
      
      name
      mfr
      type
      calories
      protein
      fat
      sodium
      fiber
      carbo
      sugars
      potass
      vitamins
      shelf
      weight
      cups
      rating
    
  
  
    
      0
      100% Bran
      N
      C
      70
      4
      1
      130
      10.0
      5.0
      6
      280
      25
      3
      1.0
      0.33
      68.402973
    
    
      1
      100% Natural Bran
      Q
      C
      120
      3
      5
      15
      2.0
      8.0
      8
      135
      0
      3
      1.0
      1.00
      33.983679
    
    
      2
      All-Bran
      K
      C
      70
      4
      1
      260
      9.0
      7.0
      5
      320
      25
      3
      1.0
      0.33
      59.425505
    
    
      3
      All-Bran with Extra Fiber
      K
      C
      50
      4
      0
      140
      14.0
      8.0
      0
      330
      25
      3
      1.0
      0.50
      93.704912
    
    
      4
      Almond Delight
      R
      C
      110
      2
      2
      200
      1.0
      14.0
      8
      -1
      25
      3
      1.0
      0.75
      34.384843



In [13]:

    
len(cer), cer.shape[0]









    Out[13]:





(77, 77)



In [15]:

    
cer['ratingID'] = cer['rating'].apply(lambda x: 0 if x<60 else 1)

# define predicted ratingID
np.random.seed(12345)
cer['predicted_ratingID'] = (cer['rating']+20*np.random.randn(cer.shape[0])).apply(lambda x: 0 if x<60 else 1)
cer.head()









    Out[15]:






  
    
      
      name
      mfr
      type
      calories
      protein
      fat
      sodium
      fiber
      carbo
      sugars
      potass
      vitamins
      shelf
      weight
      cups
      rating
      ratingID
      predicted_ratingID
    
  
  
    
      0
      100% Bran
      N
      C
      70
      4
      1
      130
      10.0
      5.0
      6
      280
      25
      3
      1.0
      0.33
      68.402973
      1
      1
    
    
      1
      100% Natural Bran
      Q
      C
      120
      3
      5
      15
      2.0
      8.0
      8
      135
      0
      3
      1.0
      1.00
      33.983679
      0
      0
    
    
      2
      All-Bran
      K
      C
      70
      4
      1
      260
      9.0
      7.0
      5
      320
      25
      3
      1.0
      0.33
      59.425505
      0
      0
    
    
      3
      All-Bran with Extra Fiber
      K
      C
      50
      4
      0
      140
      14.0
      8.0
      0
      330
      25
      3
      1.0
      0.50
      93.704912
      1
      1
    
    
      4
      Almond Delight
      R
      C
      110
      2
      2
      200
      1.0
      14.0
      8
      -1
      25
      3
      1.0
      0.75
      34.384843
      0
      1

Question 4

Write a function called "rating_confusion" which takes one argument:

cer, which is a pandas data frame

and returns:

cf, a confusion matrix where the rows correspond to predicted_ratingID and the columns correspond to ratingID.



In [ ]:

    
#### play with code here #####




# Hint: look up pandas "crosstab"

Sample output:

In [1]: rating_confusion(cer[:20])
Out[1]: ratingID     0  1
predicted_ratingID      
                0   15  0
                1   3   2

Question 5

Write a function called "prediction_metrics" which takes one argument:

cer, which is a pandas data frame

and returns:

metrics_dict, a python dictionary object where the keys are 'precision', 'recall', 'F1' and the values are the numeric values for precision, recall, and F1 score, where ratingID is the prediction target and predicted_ratingID is a model output.



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: prediction_metrics(cer[:20])
Out[1]: {'F1': 0.5714285714285715, 'precision': 0.4, 'recall': 1}

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843