Use this notebook to work on your answers and check solutions. You can then submit your functions using "hw6_submission.ipynb" or directly write your functions in a file named "hw6_answers.py". Note that "hw6_answers.py" will be the only file collected and graded for this assignment.
For questions 1-3, you will use the APD dataset that we have been working with in class.
For questions 4-5, you will use data from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets.
In [6]:
# Loading python packages and APD data file (this step does not have to be included in hw6_answers.py)
import pandas as pd
import numpy as np
df = pd.read_csv('/home/data/APD/COBRA-YTD2017.csv.gz')
Write a function called "variable_helper" which takes one argument:
and returns:
In [ ]:
#### play with code here #####
In [1]: variable_helper(df[['offense_id','beat','x','y']])
Out[1]: {'beat': 'categorical',
'offense_id': 'ordinal',
'x': 'numeric',
'y': 'numeric'}
Short explanation: offense_id is a number assigned to each offense. There is a natural ordering implied in the id number (based on order of occurrence). Because of this, offense_id is an ordinal feature. The beat uses a numeric label, but refers to a geographic location. There is no natural ordering, so beat is a categorical feature. The location variables (x and y) are numeric position coordinates.
In [ ]:
#### play with code here #####
In [1]: get_categories(df[['offense_id','beat','UC2 Literal']])
Out[1]: {'UC2 Literal': array(['AGG ASSAULT', 'AUTO THEFT', 'BURGLARY-NONRES',
'BURGLARY-RESIDENCE', 'HOMICIDE', 'LARCENY-FROM VEHICLE',
'LARCENY-NON VEHICLE', 'RAPE', 'ROBBERY-COMMERCIAL',
'ROBBERY-PEDESTRIAN', 'ROBBERY-RESIDENCE'], dtype=object),
'beat': array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
114, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212,
213, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
313, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
413, 414, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511,
512, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612,
701, 702, 703, 704, 705, 706, 707, 708, 710])}
Short explanation: UC2 Literal and beat are the only categorical variables in the data frame df[['offense_id','beat','UC2 Literal']].
In [ ]:
#### play with code here #####
For the last 2 questions, you will use the cereal data file available from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets. Execute the download and loading instructions below.
Please note
/home/data/cereal/cereal.csv
In [10]:
%%sh
## RUN BUT DO NOT EDIT THIS CELL
## run this cell to download the cereal dataset into your current directory
if [ ! -f cereal.csv ]; then
wget https://perso.telecom-paristech.fr/eagan/class/igr204/data/cereal.csv
fi
head cereal.csv
In [ ]:
In [7]:
## RUN BUT DO NOT EDIT THIS CELL
# load the data, define ratingID
cer = pd.read_csv('cereal.csv', skiprows=[1], delimiter=';')
cer.head()
Out[7]:
In [13]:
len(cer), cer.shape[0]
Out[13]:
In [15]:
cer['ratingID'] = cer['rating'].apply(lambda x: 0 if x<60 else 1)
# define predicted ratingID
np.random.seed(12345)
cer['predicted_ratingID'] = (cer['rating']+20*np.random.randn(cer.shape[0])).apply(lambda x: 0 if x<60 else 1)
cer.head()
Out[15]:
In [ ]:
#### play with code here #####
# Hint: look up pandas "crosstab"
Write a function called "prediction_metrics" which takes one argument:
and returns:
In [ ]:
#### play with code here #####