So far, we've learned about splitting our data into training and testing sets to validate our models. This helps ensure that the model we create on one sample performs well on another sample we want to predict.
However, we don't have to use just TWO samples to train and test our models. Instead, we can split our data up into MULTIPLE samples to train and test on multiple segments of the data. This is called CROSS-VALIDATION. This allows us to ensure that our model predicts outcomes over a wider range of circumstances.
Let's begin by importing our packages.
In [1]:
! conda install geopandas -qy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import geopandas as gpd
from shapely.geometry import Point, Polygon
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
In [2]:
import os
os.getcwd()
os.chdir('/home/jovyan/assignment-08-cross-validation-drewgobbi')
Today we'll be looking at 311 service requests for rodent inspection and abatement aggregated at the Census block level. The data set is already prepared for you and available in the same folder as this assignment. Census blocks are a good geographic level to analyze rodent infestations because they are drawn along natural and human-made boundaries, like rivers and roads, that rats tend not to cross.
We will look at the 'activity' variable, which indicates whether inspectors found rat burrows during an inspection (1) or not (0). Here we are looking only at inpsections in 2016. About 43 percent on inspections in 2016 led to inspectors finding and treating rat burrows, as you can see below.
In [3]:
data = pd.read_csv('rat_data_2016.csv')
In [4]:
data.columns
Out[4]:
In [5]:
data.describe().T
Out[5]:
Recall from last week that, when we do predictive analysis, we usually are not interested in the relationship between two different variables as we are when we do traditional hypothesis testing. Instead, we're interested in training a model that generates predictions that best fit our target population. Therefore, when we are doing any kind of validation, including cross-validation, it is important for us to choose the metric by which we will evaluate the performance of our models.
For this model, we will predict the locations of requests for rodent inspection and abatement in the District of Columbia. When we select a validation metric, it's important for us to think about what we want to optimize. For example, do we want to make sure that our top predictions accurately identify places with rodent infestations, so we don't send our inspectors on a wild goose chase? Then we may to look at the models precision, or what proportion of its positive predictions turn out to be positive. Or do we want to make sure we don't miss any infestations? If so, we may want to look at recall, or the proportion of positive cases that are correctly categorized by the model. If we care a lot about how the model ranks our observations, then we may want to look at the area under the ROC curve, or ROC-AUC, while if we care more about how well the model fits the data, or its "calibration," we may want to look at Brier score or logarithmic loss (log-loss).
In the case of rodent inspections, we most likely want to make sure that we send our inspectors to places where they are most likely to find rats and to avoid sending them on wild goose [rat] chases. Therefore, we will optimize for precision, which we will call from the metrics library in scikit-learn.
The metrics library in scikit-learn provides a number of different options. You should take some time to look at the different metrics that are available to you and consider which ones are most appropriate for your own research
In [6]:
from sklearn.metrics import precision_score
The next important decision we need to make when cross-validating our models is how we will define our "folds." Folds are the independent subsamples on which we train and test the data. Keep in mind that it is important that our folds are INDEPENDENT, which means we must guarantee that there's no overlap between our training and test set (i.e., no observation is in both the training and test set). Independence can also have other implications for how we slice the data, which we will discuss as we progress through this lesson.
One of the most common approaches to cross-validation is to make random splits in the data. This is often referred to as k-fold cross-validation, in which the only thing we define is the number of folds (k) that want to split our sample into. Here, I'll use the KFold function from scikit-learn's model_selection library. Let's begin by importing the library and then taking a look at how it splits our data.
In [7]:
from sklearn.model_selection import KFold
KFold divides our data into a pre-specified number of (approximately) equally-sized folds so that each observation is in the test set once. When we specify that shuffle=True, KFold first shuffles our data into a random order to ensure that the observations are randomly selected. By selecting a random_state, we can ensure that KFold selects observations the same way each time.
While there are other functions in the model_selection library that will do much of this work for us, KFold will allow us to look at what's going on in the background of our cross-validation process. Let's begin by just looking at how KFold splits our data. Here we split our data into 10 folds each with 10 percent of the data (.1).
In [8]:
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
print("TRAIN:", train_index, "TEST:", test_index)
You can see that ShuffleSplit has selected a random set of observations from the index of our data set for each fold of our cross-validation. Let's look at the size of our training and test set for each fold.
In [9]:
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
print("TRAIN:", len(train_index), "TEST:", len(test_index))
Now let's try using KFold to train and test our model on 10 different subsets of our data. Below we set our cross-validator as 'cv'. We then loop through the various splits in our data that cv creates and use it to make our training and test sets. We then use our training set to fit a Logistic Regression model and generate predictions from our test set, which we compare to the actual outcomes we observed.
In [10]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)
## Create for-loop
for train_index, test_index in cv.split(data):
## Define training and test sets
X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
y_train = data.loc[train_index]['activity']
X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
y_test = data.loc[test_index]['activity']
## Fit model
clf = LogisticRegression()
clf.fit(X_train, y_train)
## Generate predictions
predicted = clf.predict(X_test)
## Compare to actual outcomes and return precision
print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))
We can see that, for the most part, about 50 to 60 percent of the inspections our model predicts will lead our inspectors to rat burrows actually do. This is a modest improvement over our inspectors' current performance in the field. Based on these results, if we used our models to determine which locations our inspectors go to in the field, we'd probably see a 10 to 20 point increase in their likelihood of finding rat burrows.
Try running the k-fold cross-validation a few times with the same random state. Then try running it a few times with different random states. How do the results change?
In [11]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)
## Create for-loop
for train_index, test_index in cv.split(data):
## Define training and test sets
X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
y_train = data.loc[train_index]['activity']
X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
y_test = data.loc[test_index]['activity']
## Fit model
clf = LogisticRegression()
clf.fit(X_train, y_train)
## Generate predictions
predicted = clf.predict(X_test)
## Compare to actual outcomes and return precision
print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))
In [12]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=1)
## Create for-loop
for train_index, test_index in cv.split(data):
## Define training and test sets
X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
y_train = data.loc[train_index]['activity']
X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
y_test = data.loc[test_index]['activity']
## Fit model
clf = LogisticRegression()
clf.fit(X_train, y_train)
## Generate predictions
predicted = clf.predict(X_test)
## Compare to actual outcomes and return precision
print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))
In [13]:
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=17)
## Create for-loop
for train_index, test_index in cv.split(data):
## Define training and test sets
X_train = data.loc[train_index].drop(['activity', 'month', 'WARD'], axis=1)
y_train = data.loc[train_index]['activity']
X_test = data.loc[test_index].drop(['activity', 'month', 'WARD'], axis=1)
y_test = data.loc[test_index]['activity']
## Fit model
clf = LogisticRegression()
clf.fit(X_train, y_train)
## Generate predictions
predicted = clf.predict(X_test)
## Compare to actual outcomes and return precision
print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))
It's important to point out here that, because we have TIME SERIES data, the same Census blocks may be appearing in our training AND our test sets. This is a challenge to ensuring that our training and test samples are INDEPENDENT. While Rodent Control does not inspect the same blocks every month, some of the same blocks may be re-inspected from month to month depending on where 311 requests are coming from.
However, this also affords us an opportunity. More than likely, when we make predictions about which inspections will lead our inspectors to rat burrows, we are interested in predicting FUTURE inspections with observations from PAST inspections. In this case, cross-validating over time can be a very good way of looking at how well our models are performing.
Cross-validating over time requires more than just splitting by month. Rather, we will use observations from each month as a test set and train our models on all PRIOR months. Which we do below.
Let's begin by seeing what our cross-validation sets look like. Below, we loop through each of the sets to see which months end up in our training and test sets. You can see that as we move from month to month, we have more and more past observations in our training set.
In [14]:
months = np.sort(data.month.unique())
for month in range(2,13):
test = data[data.month==month]
train = data[(data.month < month)]
print('Test Month: '+str(test.month.unique()), 'Training Months: '+str(train.month.unique()))
In [15]:
months = np.sort(data.month.unique())
for month in range(2,13):
test = data[data.month==month]
train = data[(data.month < month)]
X_test = test.drop(['activity', 'month', 'WARD'], axis=1)
y_test = test['activity']
X_train = test.drop(['activity', 'month', 'WARD'], axis=1)
y_train = test['activity']
clf = LogisticRegression()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print('Precision for Month '+str(month)+': '+str(100*round(precision_score(y_test, predicted),3)))
Our model seems to be performing even better when we cross-validate over months, possibly because we're structuring the cross-validation such that inspections in some of the same blocks appear consistently over time.
Try re-creating this cross-validation, but with the training set restricted to only the 3 months prior to the test set. Now do the same with the last 1 and 2 months. Do the results change?
In [16]:
months = np.sort(data.month.unique())
for month in range(2,13):
test = data[data.month==month]
train = data[data.month>=month-3]
train = train.drop[train.month>=month]
print('Test Month: '+str(test.month.unique()), 'Training Months: '+str(train.month.unique()))
In [ ]:
In [ ]:
In [ ]:
We may still be concerned about the independence of our training and test sets. In particular, as I've pointed out, the same Census blocks may appear repeatedly in our data over time. In this case, it may be good to cross-validate geographically to make sure that our model is performing well in different parts of the city. In particular, we know that requests for rodent abatement (and rats themselves) are more common in some parts of the city than in others. In particular, rats are more common in the more densely-populated parts of downtown and less common in less densely-populated places like Wards 3, 7, and 8. Therefore, we may be interested in cross-validating by ward.
Again, this is as simple as looping through each of the 8 wards, holding out each ward as a test set and training the models on observations from the remaining wards.
In [ ]:
data.WARD.value_counts().sort_index()
In [75]:
for ward in np.sort(data.WARD.unique()):
test = data[data.WARD == ward]
train = data[data.WARD != ward]
X_test = test.drop(['activity', 'month', 'WARD'], axis=1)
y_test = test['activity']
X_train = test.drop(['activity', 'month', 'WARD'], axis=1)
y_train = test['activity']
clf = LogisticRegression()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print('Precision for Ward '+str(ward)+': '+str(100*round(precision_score(y_test, predicted),3)))
Here we see that the model performs very well predicting the outcomes of inspections in wards 1 through 4, but less well in wards 5 though 8. In wards 7 and 8 in particular, the model fails to predict any positive cases. This means that our model may be overfit to observations in Wards 1 through 6, and we may want to re-evaluate our approach.
Explore the data and our model and try to come up with some reasons that the model is performing poorly on Wards 7 and 8. Is there a way we can fix the model to perform better on those wards? How might we fix the model?
In [19]:
data.head().T
Out[19]:
In [24]:
data.describe().T
Out[24]:
In [41]:
data.bbl_restaurant.value_counts()
Out[41]:
In [42]:
data.groupby('WARD').bbl_restaurant.value_counts()
Out[42]:
In [65]:
data.groupby(data.activity==1).WARD.value_counts().sort_values(ascending = False)
Out[65]:
In [79]:
data.groupby('WARD').tot_pop.sum().sort_values(ascending = False)
Out[79]:
Looks like Ward 3, 7, 8 are about the same size. Model is MOST Accurate in Ward 3 and not accurate at all in Wards 7 and 8. Maybe our model is overfit to 3 -- what about these wards are different?
In [96]:
three = data[data.WARD==3]
seven = data[data.WARD==7]
eight = data[data.WARD==8]
three.activity.value_counts()
Out[96]:
In [97]:
seven.activity.value_counts()
Out[97]:
In [98]:
eight.activity.value_counts()
Out[98]:
In [108]:
data.groupby('WARD').activity.value_counts(sort=True)
Out[108]:
Ward's 5-8 have different active-not active ratios. They're most out of whack in 6-8, the least accurate of our predictions. Ratio of not active/to active is about 1 (give or take .25) for every value.
Things are about 3 times as inactive and 5-6 times as inactive in wards 6-8. Ward 3, the sample we could be overfitting to is about 1.5 times more inactive. A bit weird, but more reasonable. This could be a source of our issue?
Now try running some cross-validations with the data from your project. What are some different ways you might slice the data you're using for your project? Try them out here. This will be a good way to begin making progress toward your final submission.
PLEASE REMEMBER TO SUBMIT THIS HOMEWORK BY CLASS TIME ON THURSDAY.
In [124]:
data.describe().T
Out[124]:
In [132]:
df1 = pd.read_csv('https://opendata.arcgis.com/datasets/82ab09c9541b4eb8ba4b537e131998ce_22.csv')
df2 = pd.read_csv('https://opendata.arcgis.com/datasets/f2e1c2ef9eb44f2899f4a310a80ecec9_2.csv')
In [131]:
Out[131]:
In [133]:
df2.describe().T
Out[133]:
In [162]:
df1 = pd.read_csv('https://opendata.arcgis.com/datasets/82ab09c9541b4eb8ba4b537e131998ce_22.csv')
df2 = pd.read_csv('https://opendata.arcgis.com/datasets/f2e1c2ef9eb44f2899f4a310a80ecec9_2.csv')
DF = df1.merge(df2, on ='X', how = 'outer')
from sklearn.metrics import precision_score
from sklearn.model_selection import KFold
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
print("TRAIN:", train_index, "TEST:", test_index)
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
print("TRAIN:", len(train_index), "TEST:", len(test_index))
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)
## Create for-loop
for train_index, test_index in cv.split(data):
## Define training and test sets
X_train = DF.loc[train_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
y_train = DF.loc[train_index]['BBL_LICENSE_FACT_ID']
X_test = DF.loc[test_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
y_test = DF.loc[test_index]['BBL_LICENSE_FACT_ID']
## Fit model
clf = LogisticRegression()
clf.fit(X_train, y_train)
## Generate predictions
predicted = clf.predict(X_test)
## Compare to actual outcomes and return precision
print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))
In [164]:
DF
Out[164]:
In [145]:
from sklearn.metrics import precision_score
from sklearn.model_selection import KFold
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
print("TRAIN:", train_index, "TEST:", test_index)
cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(data):
print("TRAIN:", len(train_index), "TEST:", len(test_index))
## Define function
cv = KFold(n_splits=10, shuffle=True, random_state=0)
## Create for-loop
for train_index, test_index in cv.split(data):
## Define training and test sets
X_train = DF.loc[train_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
y_train = DF.loc[train_index]['BBL_LICENSE_FACT_ID']
X_test = DF.loc[test_index].drop(['BBL_LICENSE_FACT_ID'], axis=1)
y_test = DF.loc[test_index]['BBL_LICENSE_FACT_ID']
## Fit model
clf = LogisticRegression()
clf.fit(X_train, y_train)
## Generate predictions
predicted = clf.predict(X_test)
## Compare to actual outcomes and return precision
print('Precision: '+str(100 * round(precision_score(y_test, predicted),3)))
In [ ]: