End-to-End Data Science in Python

Introduction

This is the workbook for the "End-to-End Data Analysis in Python" workshop at the Open Data Science Conference 2015, in beautiful San Francisco. This notebook contains starter code only; the goal is that we will fill in the gaps together as we progress through the workshop. If, however, you're doing this asynchronously or you get stuck, you can reference the solutions workbook.

The objective is to complete the "Pump it Up: Mining the Water Table" challenge on drivendata.org; the objective here is to predict African wells that are non-functional or in need of repair. Per the rules of the competition, you should register for an account with drivendata.org, at which point you can download the training set values and labels. We will be working with those datasets during this workshop. You should download those files to the directory in which this notebook lives, and name them wells_features.csv and wells_labels.csv (to be consistent with our nomenclature). You are also encouraged to continue developing your solution after this workshop, and/or to enter your solution in the competition on the drivendata website!

Code requirements

Here's the environment you'll need to work with this code base:

  • python 3 (2.x may work with minor changes, but no guarantees)
  • pandas
  • scikit-learn
  • numpy

First Draft of an Analysis


In [1]:
import pandas as pd
  
features_df = pd.DataFrame.from_csv("well_data.csv")
labels_df   = pd.DataFrame.from_csv("well_labels.csv")  
print( labels_df.head() )

One nice feature of ipython notebooks is it's easy to make small changes to code and then re-execute quickly, to see how things change. For example, printing the first 5 lines of the labels dataframe (which is the default) isn't really ideal here, since there's a label ("functional needs repair") which doesn't appear in the first five lines. Type 20 in the parentheses labels_df.head(), so it now reads labels_df.head(20), and press shift-enter to rerun the code. See the difference?

Now take a quick look at the features, again by calling .head() (set up for you in the code box below, or add your own code to the code box above). You can print or as few rows as you like. Take a quick look at the data--approximately how many features are there? Are they all numeric, or will you have to do work to transform non-numeric features into numbers?


In [1]:
print( features_df.head() )

Transforming string labels into integers

The machine learning algorithms downstream are not going to handle it well if the class labels used for training are strings; instead, we'll want to use integers. The mapping that we'll use is that "non functional" will be transformed to 0, "functional needs repair" will be 1, and "functional" becomes 2.

There are a number of ways to do this; the framework below uses applymap() in pandas. Here's the documentation for applymap(); in the code below, you should fill in the function body for label_map(y) so that if y is "functional", label_map returns 2; if y is "functional needs repair" then it should return 1, and "non functional" is 0. There's a print statement there to help you confirm that the label transformation is working properly.

As an aside, you could also use apply() here if you like. The difference between apply() and applymap() is that applymap() operates on a whole dataframe while apply() operates on a series (or you can think of it as operating on one column of your dataframe). Since labels_df only has one column (aside from the index column), either one will work here.


In [2]:
def label_map(y):
   ### your code goes here
labels_df = labels_df.applymap(label_map)

print(labels_df.head())

Transforming string features into integers

Now that the labels are ready, we'll turn our attention to the features. Many of the features are categorical, where a feature can take on one of a few discrete values, which are not ordered. Fill in the function body of transform_feature( df, column ) below so that it takes our features_df and the name of a column in that dataframe, and returns the same dataframe but with the indicated feature encoded with integers rather than strings.

We've provided code to wrap your transformer function in a loop iterating through all the columns that should be transformed.

Last, add a line of code at the bottom of the block below that removes the date_recorded column from features_df. Time-series information like dates and times need special treatment, which we won't be going into today.


In [ ]:
def transform_feature( df, column_name ):
    ### your code goes here
    return df

### list of column names indicating which columns to transform; 
### this is just a start!  Use some of the print( labels_df.head() )
### output upstream to help you decide which columns get the
### transformation
names_of_columns_to_transform = ["funder", "installer", "wpt_name"]
for column in names_of_columns_to_transform:
    features_df = transform_feature( features_df, column )
    
### remove the "date_recorded" column--we're not going to make use
### of time-series data today

Ok, a couple last steps to get everything ready for sklearn. The features and labels are taken out of their dataframes and put into a numpy.ndarray and list, respectively.


In [3]:
X = features_df.as_matrix()
y = labels_df["status_group"].tolist()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-dede2d6e4ee5> in <module>()
----> 1 X = features_df.as_matrix()
      2 y = labels_df["status_group"].tolist()

NameError: name 'features_df' is not defined

Predicting well failures with logistic regression

The cheapest and easiest way to train on one portion of your dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score(). This splits your data into 3 equal portions, trains on two of them, and tests on the third. This process repeats 3 times. That's why 3 numbers get printed in the code block below.

You don't have to add anything to the code block, it's ready to go already. However, use it for reference in the next part of the tutorial, where you will be looking at other sklearn algorithms.


In [6]:
import sklearn.linear_model
import sklearn.cross_validation

clf = sklearn.linear_model.LogisticRegression()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )


Out[6]:
DecisionTreeRegressor(compute_importances=None,
           criterion=<sklearn.tree._tree.FriedmanMSE object at 0x3e4eb28>,
           max_depth=3, max_features=None, max_leaf_nodes=None,
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           random_state=<mtrand.RandomState object at 0x7feca45d6660>,
           splitter=<sklearn.tree._tree.PresortBestSplitter object at 0x3da15b0>)

Comparing logistic regression to tree-based methods

We have a baseline logistic regression model for well failures. Let's compare to a couple of other classifiers, a decision tree classifier and a random forest classifier, to see which one seems to do the best.

Code this up on your own. You can use the code in the box above as a kind of template, and just drop in the new classifiers. The sklearn documentation might also be helpful:

We will talk about all three of these models more in the next part of the tutorial.


In [ ]:
### your code here

Congratulations! You have a working data science setup, in which you have:

  • read in data
  • transformed features and labels to make the data amenable to machine learning
  • made a train/test split (this was done implicitly when you called cross_val_score)
  • evaluated several models for identifying wells that are failed or in danger of failing

Paying down technical debt and tuning the models

We got things running really fast, which is great, but at the cost of being a little quick-and-dirty about some details. First, we got the features encoded as integers, but they really should be dummy variables. Second, it's worth going through the models a little more thoughtfully, to try to understand their performance and if there's any more juice we can get out of them.

One-hot encoding to make dummy variables

A problem with representing categorical variables as integers is that integers are ordered, while categories are not. The standard way to deal with this is to use dummy variables; one-hot encoding is a very common way of dummying. Each possible category becomes a new boolean feature. For example, if our dataframe looked like this:

index country 1 "United States" 2 "Mexico" 3 "Mexico" 4 "Canada" 5 "United States" 6 "Canada"

then after dummying it will look something like this:

index country_UnitedStates country_Mexico country_Canada 1 1 0 0 2 0 1 0 3 0 1 0 4 0 0 1 5 1 0 0 6 0 0 1

Hopefully the origin of the name is clear--each variable is now encoded over several boolean columns, one of which is true (hot) and the others are false.

Now we'll write a hot-encoder function that takes the data frame and the title of a column, and returns the same data frame but one-hot encoding performed on the indicated feature.

Protip: sklearn has a one-hot encoder function available that will be your friend here.


In [ ]:
def hot_encoder(df, column_name)
    ### your code goes here
    return df

Now we'll take the to_transform list that you populated above with categorical variables, and use that to loop through columns that will be one-hot encoded.

One note before you code that up: one-hot encoding comes with the baggage that it makes your dataset bigger--sometimes a lot bigger. In the countries example above, one column that encoded the country has now been expanded out to three columns. You can imagine that this can sometimes get really, really big (imagine a column encoding all the counties in the United States, for example).

There are some columns in this example that will really blow up the dataset, so we'll remove them before proceeding with the one-hot encoding.


In [ ]:
features_df.drop( "funder", axis=1, inplace=True )
features_df.drop( "installer", axis=1, inplace=True )
features_df.drop( "wpt_name", axis=1, inplace=True )
features_df.drop( "subvillage", axis=1, inplace=True )
features_df.drop( "ward", axis=1, inplace=True )

for feature in to_transform:
    features_df = hot_encode( features_df, feature )

Now that the features are a little fixed up, I'd invite you to rerun the models, and see if the cross_val_score goes up as a result. It is also a great chance to take some of the theory discussion from the workshop and play around with the parameters of your models, and see if you can increase their scores that way. There's a blank code box below where you can play around.


In [4]:
### your code here


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-4fa70eb56ab4> in <module>()
----> 1 X = features_df.as_matrix()
      2 y = labels_df["status_group"].tolist()
      3 
      4 clf = sklearn.ensemble.RandomForestClassifier()
      5 score = sklearn.cross_validation.cross_val_score( clf, X, y )

NameError: name 'features_df' is not defined

End-to-end workflows using Pipeline and GridSearchCV

So far we have made a nice workflow using a few ideas assembled in a script-like workflow. A few spots remain where we can tighten things up though:

  • the best model, the random forest, has a lot of parameters that we'd have to work through if we really wanted to tune it
  • after dummying, we have lots of features, probably only a subset of which are really offering any discriminatory power (this is a version of the bias-variance tradeoff)
  • maybe there's a way to make the code more streamlined (hint: there is)

We will solve all these with two related and lovely tools in sklearn: Pipeline and GridSearchCV.

Pipeline in sklearn is a tool for chaining together multiple pieces of a workflow into a single coherent analysis. In our example, we will chain together a tool for feature selection, to will address the second point, which then feeds our optimized feature set into the random forest model, all in a few lines of code (which addresses the third point).

To get to the first point, about finding the best parameters--that's where the magic of GridSearchCV comes in. But first we need to get the feature selector and pipeline up and running, so let's do that now.

In sklearn.feature_selection there is a useful tool, SelectKBest (link)[http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html] that you should use. By default, this will select the 10 best features; that seems like it might be too few features to do well on this problem, so change the number of features to 100.


In [5]:
import sklearn.feature_selection

### your code goes here


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-8bca07a79e38> in <module>()
      2 
      3 select = sklearn.feature_selection.SelectKBest(k=100)
----> 4 selected_X = select.fit_transform(X, y)
      5 
      6 print( selected_X.shape )

NameError: name 'X' is not defined

Pipeline

After selecting the 100 best features, the natural next step would be to run our random forest again to see if it does a little better with fewer features. So we would have SelectKBest doing selection, with the output of that process going straight into a classifier. A Pipeline packages the transformation step of SelectKBest with the estimation step of RandomForestClassifier into a coherent workflow.

Why might you want to use Pipeline instead of keeping the steps separate?

  • makes code more readable
  • don't have to worry about keeping track data during intermediate steps, for example between transforming and estimating
  • makes it trivial to move ordering of the pipeline pieces, or to swap pieces in and out
  • Allows you to do GridSearchCV on your workflow

This last point is, in my opinion, the most important. We will get to it very soon, but first let's get a pipeline up and running that does SelectKBest followed by RandomForestClassifier.

In the code box below, I've also set up a slightly better training/testing structure, where I am explicitly splitting the data into training and testing sets which we'll use below. The training/testing split before was handled automatically in cross_val_score, but we'll be using a different evaluation metric from here forward, the classification report, which requires us to handle the train/test split ourselves.


In [6]:
import sklearn.pipeline

select = ### fill this in--SelectKBest, k=100
clf = ### fill this in--RandomForestClassifier

pipeline = ### fill this in, using sklearn docs as a guide

X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.33, random_state=42)

### fit your pipeline on X_train and y_train
### call pipeline.predict() on your X_test data to make a set of test predictions
### test your predictions using sklearn.classification_report()
### and print the report


  File "<ipython-input-6-eeda30413ef3>", line 3
    select = ### fill this in--SelectKBest, k=100
                                                 ^
SyntaxError: invalid syntax

Reading the classification report

A brief aside--we've switched from cross_val_score to classification_report for evaluation, mostly to show you two different ways to evaluating a model. The classification report has the advantage of giving you a lot more information, and if (for example) one class is more important to get right than the others (say you're trying to zero in on non-functional wells, so finding those correctly is more important than getting the functional wells right).

For more information, the sklearn docs on classification_report are, like all the sklearn docs, incredibly helpful. For interpreting the various metrics, this page may also help.

GridSearchCV

We're in the home stretch now. When we decided to select the 100 best features, setting that number to 100 was kind of a hand-wavey decision. Similarly, the RandomForestClassifier that we're using right now has all its parameters set to their default values, which might not be optimal.

So, a straightforward thing to do now is to try different values of k and any RandomForestClassifier parameters we want to tune (for the sake of concreteness, let's play with n_estimators and min_samples_split). Trying lots of values for each of these free parameters is tedious, and there can sometimes be interactions between the choices you make in one step and the optimal value for a downstream step. In other words, to avoid local optima, you should try all the combinations of parameters, and not just vary them independently. So if you want to try 5 different values each for k, n_estimators and min_samples_split, that means 5 x 5 x 5 = 125 different combinations to try. Not something you want to do by hand.

GridSearchCV allows you to construct a grid of all the combinations of parameters, tries each combination, and then reports back the best combination/model.


In [ ]: