:mod:diogenes.modify provides tools for manipulating arrays and generating features.
diogenes.modify.modify.replace_missing_valsdiogenes.modify.modify.label_encodediogenes.modify.modify.choose_cols_wherediogenes.modify.modify.choose_rows_wherediogenes.modify.modify.remove_cols_wherediogenes.modify.modify.remove_rows_wherediogenes.modify.modify.generate_bindiogenes.modify.modify.normalizediogenes.modify.modify.combine_colsdiogenes.modify.modify.distance_from_pointdiogenes.modify.modify.where_all_are_true Diogenes provides two functions for data cleaning:
diogenes.modify.modify.replace_missing_vals, which replaces missing values with valid onces.diogenes.modify.modify.label_encode which replaces strings with corresponding integers.For this example, we'll look at Chicago's "311 Service Requests - Tree Debris" data on the Chicago data portal (https://data.cityofchicago.org/)
In [51]:
import diogenes
data = diogenes.read.open_csv_url('https://data.cityofchicago.org/api/views/mab8-y9h3/rows.csv?accessType=DOWNLOAD',
parse_datetimes=['Creation Date', 'Completion Date'])
The last row of this data set repeats the labels. We're going to go ahead and omit it.
In [52]:
data = data[:-1]
In [53]:
data.dtype
Out[53]:
We're going to predict whether a job is still open, so our label will ultimately be the "Status" column.
In [54]:
from collections import Counter
print Counter(data['Status']).most_common()
We'll remove the label from the rest of the data later. First, let's do some cleaning. Notice that we have some missing data for our floating point variables (encoded as numpy.nan)
In [55]:
import numpy as np
print sum(np.isnan(data['ZIP Code']))
print sum(np.isnan(data['Ward']))
print sum(np.isnan(data['X Coordinate']))
Sklearn can't tolerate these missing values, so we have to do something with them. Probably, a statistically sound thing to do with this data would be to leave these rows out, but for pedagogical purposes, let's assume it makes sense to impute the data. We can do that with :func:diogenes.modify.modify.replace_missing_vals.
We could, for instance, replace every nan with a 0:
In [56]:
data_with_zeros = diogenes.modify.replace_missing_vals(data, strategy='constant', constant=0)
print sum(np.isnan(data_with_zeros['ZIP Code']))
print sum(data_with_zeros['ZIP Code'] == 0)
Looks like there were a few entries that had 0 for a zip code already.
For the purposes of this tutorial, we will go ahead and replace missing values with the most frequent value in the column:
In [57]:
data = diogenes.modify.replace_missing_vals(data, strategy='most_frequent')
Our data also has a number of string columns. Strings must be converted to numbers before Scikit-Learn can analyze them, so we will use :func:diogenes.modify.modify.label_encode to convert them
In [58]:
print Counter(data['If Yes, where is the debris located?']).most_common()
data, classes = diogenes.modify.label_encode(data)
print Counter(data['If Yes, where is the debris located?']).most_common()
print classes['If Yes, where is the debris located?']
Note that classes is a dictionary of arrays where each key is the column name and each value is an array of which string each number represents. For example, if we wanted to find out what category 1 represents, we would look at:
In [59]:
classes['If Yes, where is the debris located?'][1]
Out[59]:
and find that category 1 is 'Alley'
Diogenes provides a number of functions to retain only columns and rows matching a specific criteria:
diogenes.modify.modify.choose_cols_wherediogenes.modify.modify.remove_cols_wherediogenes.modify.modify.choose_rows_wherediogenes.modify.modify.remove_rows_whereThese are explained in detail in the module documentation for :mod:diogenes.modify.modify. Explaining all the different things you can do with these selection operators is outside the scope of this tutorial.
We'll start out by removing any columns for which every row is the same value by employing the :func:diogenes.modify.modify.col_val_eq_any column selection function:
In [60]:
print data.dtype.names
print
print Counter(data['Type of Service Request'])
print
arguments = [{'func': diogenes.modify.col_val_eq_any, 'vals': None}]
data = diogenes.modify.remove_cols_where(data, arguments)
print data.dtype.names
Notice that "Type of Service Request" has been removed, since every value in the column was the same
Next, let's assume that we're only interested in requests made during the year 2015 and select only those rows using the :func:diogenes.modify.modify.row_val_between row selection function:
In [61]:
print data.shape
print data['Creation Date'].min()
print data['Creation Date'].max()
print
arguments = [{'func': diogenes.modify.row_val_between,
'vals': [np.datetime64('2015-01-01T00:00:00', 'ns'), np.datetime64('2016-01-01T00:00:00', 'ns')],
'col_name': 'Creation Date'}]
data = diogenes.modify.choose_rows_where(data, arguments)
print data.shape
print data['Creation Date'].min()
print data['Creation Date'].max()
Finally, let's remove rows which the "Status" column claims are duplicates. We review our classes variable to find:
In [62]:
classes['Status']
Out[62]:
We want to remove rows that have either 1 or 3 in the status column. We don't have a row selection function already defined to select rows that have one of several discrete values, so we will create one:
In [63]:
def row_val_in(M, col_name, vals):
return np.logical_or(M[col_name] == vals[0], M[col_name] == vals[1])
print data.shape
print Counter(data['Status']).most_common()
print
arguments = [{'func': row_val_in, 'vals': [1, 3], 'col_name': 'Status'}]
data2 = diogenes.modify.remove_rows_where(data, arguments)
print data2.shape
print Counter(data2['Status']).most_common()
In [65]:
dist_from_cloud_gate = diogenes.modify.distance_from_point(41.882773, -87.623304, data['Latitude'], data['Longitude'])
print dist_from_cloud_gate[:10]
Now we'll put those distances into 10 bins using :func:diogenes.modify.modify.generate_bin.
In [66]:
dist_binned = diogenes.modify.generate_bin(dist_from_cloud_gate, 10)
print dist_binned[:10]
Now we'll make a binary feature that is true if and only if the tree is in a parkway in ward 10 using :func:diogenes.modify.modify.where_all_are_true (which has similar syntax to the selection functions).
In [67]:
print classes['If Yes, where is the debris located?']
We note that "Parkway" is category 2, so we will select items that equal 2 in the "If Yes, where is the debris located?" column and 10 in the "Ward" column.
In [68]:
arguments = [{'func': diogenes.modify.row_val_eq,
'col_name': 'If Yes, where is the debris located?',
'vals': 2},
{'func': diogenes.modify.row_val_eq,
'col_name': 'Ward',
'vals': 10}]
parkway_in_ward_10 = diogenes.modify.where_all_are_true(data, arguments)
print np.where(parkway_in_ward_10)
Finally, we'll add all of our generated features to our data using :func:diogenes.utils.append_cols
In [70]:
data = diogenes.utils.append_cols(data, [dist_from_cloud_gate, dist_binned, parkway_in_ward_10],
['dist_from_cloud_gate', 'dist_binned', 'parkway_in_ward_10'])
print data.dtype
In [77]:
labels = data['Status']
M = diogenes.utils.remove_cols(data, ['Status', 'Completion Date'])
exp = diogenes.grid_search.experiment.Experiment(M, labels)
exp.run()
Out[77]:
In [ ]: