Data Science is Software


Developer #lifehacks for the Jupyter Data Scientist

Section 3: Refactoring for reusability


In [ ]:
%matplotlib inline
from __future__ import print_function

import os

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

PROJ_ROOT = os.path.join(os.pardir, os.pardir)

Use debugging tools throughout!

Don't forget all the fun debugging tools we covered while you work on these exercises.

  • %debug
  • %pdb
  • import q;q.d()
  • And (if necessary) %prun

Exercise 1

You'll notice that our dataset actually has two different files, pumps_train_values.csv and pumps_train_labels.csv. We want to load both of these together in a single DataFrame for our exploratory analysis. Create a function that:

  • Reads both of the csvs
  • uses the id column as the index
  • parses dates of the date_recorded columns
  • joins the labels and the training set on the id
  • returns the complete dataframe

In [ ]:
def load_pumps_data(values_path, labels_path):
    # YOUR CODE HERE
    pass
    
    
values = os.path.join(PROJ_ROOT, "data", "raw", "pumps_train_values.csv")
labels = os.path.join(PROJ_ROOT, "data", "raw", "pumps_train_labels.csv")

df = load_pumps_data(values, labels)
assert df.shape == (59400, 40)

In [ ]:
#SOLUTION
def load_pumps_data(values_path, labels_path):

    train = pd.read_csv(values_path, index_col='id', parse_dates=["date_recorded"])
    labels = pd.read_csv(labels_path, index_col='id')

    return train.join(labels)

values = os.path.join(PROJ_ROOT, "data", "raw", "pumps_train_values.csv")
labels = os.path.join(PROJ_ROOT, "data", "raw", "pumps_train_labels.csv")

df = load_pumps_data(values, labels)

assert df.shape == (59400, 40)

Exercise 2

Now that we've loaded our data, we want to do some pre-processing before we model. From inspection of the data, we've noticed that there are some numeric values that are probably not valid that we want to replace.

  • Select the relevant columns for modeling. For the purposes of this exercise, we'll select:

     useful_columns = ['amount_tsh',
                   'gps_height',
                   'longitude',
                   'latitude',
                   'region',
                   'population',
                   'construction_year',
                   'extraction_type_class',
                   'management_group',
                   'quality_group',
                   'source_type',
                   'waterpoint_type',
                   'status_group']
  • Replace longitude, and population where it is 0 with mean for that region.

    zero_is_bad_value = ['longitude', 'population']
  • Replace the latitude where it is -2E-8 (a different bad value) with the mean for that region.

    other_bad_value = ['latitude']
  • Replace construction_year less than 1000 with the mean construction year.

  • Convert object type (i.e., string) variables to categoricals.
  • Convert the label column into a categorical variable

A skeleton for this work is below where clean_raw_data will call replace_value_with_grouped_mean internally.

Copy and Paste the skeleton below into a Python file called preprocess.py in src/features/. Import and autoload the methods from that file to run tests on your changes in this notebook.


In [ ]:
def clean_raw_data(df):
    """ Takes a dataframe and performs four steps:
            - Selects columns for modeling
            - For numeric variables, replaces 0 values with mean for that region
            - Fills invalid construction_year values with the mean construction_year
            - Converts strings to categorical variables
            
        :param df: A raw dataframe that has been read into pandas
        :returns: A dataframe with the preprocessing performed.
    """
    pass
    
def replace_value_with_grouped_mean(df, value, column, to_groupby):
    """ For a given numeric value (e.g., 0) in a particular column, take the
        mean of column (excluding value) grouped by to_groupby and return that
        column with the value replaced by that mean.

        :param df: The dataframe to operate on.
        :param value: The value in column that should be replaced.
        :param column: The column in which replacements need to be made.
        :param to_groupby: Groupby this variable and take the mean of column.
                           Replace value with the group's mean.
        :returns: The data frame with the invalid values replaced
    """
    pass

In [ ]:
#SOLUTION
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(PROJ_ROOT, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport features.preprocess_solution
from features.preprocess_solution import clean_raw_data

In [ ]:
cleaned_df = clean_raw_data(df)

# verify construction year
assert (cleaned_df.construction_year > 1000).all()

# verify filled in other values
for numeric_col in ["population", "longitude", "latitude"]:
    assert (cleaned_df[numeric_col] != 0).all()
    
# verify the types are in the expected types
assert (cleaned_df.dtypes
                  .astype(str)
                  .isin(["int64", "float64", "category"])).all()

# check some actual values
assert cleaned_df.latitude.mean() == -5.970642969008563
assert cleaned_df.longitude.mean() == 35.14119354200863
assert cleaned_df.population.mean() == 277.3070009774711

Exercise 3

Now that we've got a feature matrix, let's train a model! Add a function as defined below to the src/model/train_model.py

The function should use sklearn.linear_model.LogisticRegression to train a logistic regression model. In a dataframe with categorical variables pd.get_dummies will do encoding that can be passed to sklearn.

The LogisticRegression class in sklearn handles muticlass models automatically, so no need to use get_dummies on status_group.

Finally, this method should return a GridSearchCV object that has been run with the following parameters for a logistic regression model:

params = {'C': [0.1, 1, 10]}

In [ ]:
def logistic(df):
    """ Trains a multinomial logistic regression model to predict the
        status of a water pump given characteristics about the pump.
    
        :param df: The dataframe with the features and the label.
        :returns: A trained GridSearchCV classifier
    """
    pass

In [ ]:
#SOLUTION

#import my method from the source code
%aimport model.train_model_solution
from model.train_model_solution import logistic

In [ ]:
%%time
clf = logistic(cleaned_df)

assert clf.best_score_ > 0.5

In [ ]:
# Just for fun, let's profile the whole stack and see what's slowest!
%prun logistic(clean_raw_data(load_pumps_data(values, labels)))