Note about variable scope in Jupyter Notebooks (.ipynb files)

When a variable is declared in a cell and that cell is run, the variable enters the global namespace for that notebook.
If you want to restrict variables to a local scope, wrap them in a function:


In [18]:
def foo():
    y = 'cuz_i'
    return y

foo()


Out[18]:
'cuz_i'

In [19]:
def bar():
    y = 'said_so'
    return y

bar()


Out[19]:
'said_so'

If you aren't careful about variable scope, confusion may ensue; use caution.
Now on to:

The scoop on missing values.

There are many ways data can end up with missing values.

  • A 2 bedroom house doesn't have value for a third bedroom.
  • Someone being surveyed may choose not to share their income.

Python libraries represent missing numbers as nan which is short for "not a number".
You can detect which cells have missing values, and then count how many there are in each column with the command:

print(data.isnull().sum())

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.
Let's figure out how to deal with them.

1. You can drop columns with missing values:

data_without_missing_values = original_data.dropna(axis=1)

If you want to drop the same columns from the DataFrames in both your training dataset and test dataset:

cols_with_missing_vals = [col for col in original_data.columns if original_data[col].isnull().any()] reduced_original_data = original_data.drop(cols_with_missing, axis=1) reduced_test_data = test_data.drop(cols_with_missing, axis=1)

This method discards all information in the entire column, so it can be useful when most values in a column are missing.

2. You can impute missing values:

Imputation replaces the missing value with some number (the mean, for example), which usually gives more accurate models than dropping the column entirely.

from sklearn.preprocessing import Imputer my_imputer = Imputer() data_with_imputed_values = my_imputer.fit_transform(original_data)

Imputation can also be included in a scikit-learn Pipeline, which simplify model building, validation, and deployment.

3. You can extend imputation to consider which values were originally missing:

Imputation is the standard approach, and it usually works well.
However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset).
Or rows with missing values may be unique in some other way.
In that case, your model would make better predictions by considering which values were originally missing.
Here's how it might look:

# Make a copy to keep the original data intact: new_data = original_data.copy() # Make new columns indicating what will be imputed: cols_with_missing = [col for col in new_data.columns if new_data[c].isnull().any()] # Compare and contrast: for col in cols_with_missing: new_data[col + ' was missing'] = new_data[col].isnull() # Impute: my_imputer = Imputer() new_data = my_imputer.fit_transform(new_data)

This approach may or may not improve the results compared to simply imputing values.

An example comparing the solutions using the Melbourne Housing data.

We will see an example predicting housing prices from the Melbourne Housing data.


In [20]:
import pandas as pd

mb_data = pd.read_csv('input/melbourne_data.csv')

In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split

mb_target = mb_data.Price
mb_predictors = mb_data.drop(['Price'], axis=1)

# In order to simplify this example, only numeric predictors are used.
mb_numeric_predictors = mb_predictors.select_dtypes(exclude=['object'])

Create a function to measure how well each approach performs:

We divide our data into training and test.
We've loaded a function score_dataset(X_train, X_test, y_train, y_test) to compare the quality of different approaches to missing values.
This function reports the out-of-sample MAE score from a RandomForest.


In [22]:
X_train, X_test, y_train, y_test = train_test_split(mb_numeric_predictors, mb_target, train_size=0.7,
                                                   test_size=0.3, random_state=0)

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mae(y_test, preds)

Dropping columns with missing values:


In [23]:
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error after dropping columns with missing values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))


Mean Absolute Error after dropping columns with missing values:
349491.40597934404

Get model score from imputation:


In [24]:
from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error after imputing misssing values:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))


Mean Absolute Error after imputing misssing values:
204969.41964123936

Get score after imputation and display imputed values:


In [25]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

# https://www.python.org/dev/peps/pep-0289/
cols_with_missing = (col for col in X_train.columns if X_train[col].isnull().any())

for col in cols_with_missing:
    imputed_X_train_plus[col + ' was missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + ' was missing'] = imputed_X_test_plus[col].isnull()
    
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error while tracking imputed values:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))


Mean Absolute Error while tracking imputed values:
203061.2769885849

The difference between imputation and imputation with extension is trivial compared to dropping entire columns.

Categorical data is data that takes only a limited number of values.
For example, if you people responded to a survey about which what brand of car they owned, the result would be categorical (because the answers would be things like Honda, Toyota, Ford, None).
Responses fall into a fixed set of categories.
You will get an error if you try to plug these variables into most machine learning models in Python without "encoding" them first.
Here we'll show the most popular method for encoding categorical variables.

One hot encoding is the most widespread approach, and it works very well unless your categorical variables take on a large number of values.
You generally won't use it for variables taking more than about a dozen different values.
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data.

The values in the original data are Red, Yellow, and Green.
We create a separate column for each possible value.
Wherever the original value was Red, a 1 is entered into the Red column, and so forth.

Example

Let's begin at the point where the train_predictors and test_predictors DataFrames set up.
This data contains the housing characteristics.
You will use them to predict home prices, which are stored in a pandas Series called target.


In [26]:
import pandas as pd

train_data = pd.read_csv('input/train.csv')
test_data = pd.read_csv('input/test.csv')

# Disregard the houses that are missing the target value:
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)

target = train_data.SalePrice

# Keep this simple (and inaccurate, remember) and just drop columns with missing values.
# Try experimenting with different and better ways to go about this.
cols_with_missing = [col for col in train_data.columns if train_data[col].isnull().any()]

# Let's get some predictors chosen now:
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)

# Cardinality == number of unique values in a column.
# This is a convenient (and arbitrary) way of selecting categorical columns.

low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]

numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]

my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

Pandas assigns a data type dtype to each Series (column).
Let's take a look at a random sample of dtypes from the prediction data:


In [27]:
train_predictors.dtypes.sample(10)


Out[27]:
YearBuilt        int64
BsmtHalfBath     int64
MSSubClass       int64
ScreenPorch      int64
MSZoning        object
GarageCars       int64
TotRmsAbvGrd     int64
CentralAir      object
FullBath         int64
OverallCond      int64
dtype: object

Object usually indicates a column has text.
It's common to one-hot encode these "object" columns, since they can't be plugged directly into most models.
With pandas you can use the get_dummies() function for one-hot encoding.


In [28]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_training_predictors[:5]


Out[28]:
MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 60 8450 7 5 2003 2003 706 0 150 856 ... 0 0 0 1 0 0 0 0 1 0
1 20 9600 6 8 1976 1976 978 0 284 1262 ... 0 0 0 1 0 0 0 0 1 0
2 60 11250 7 5 2001 2002 486 0 434 920 ... 0 0 0 1 0 0 0 0 1 0
3 70 9550 7 5 1915 1970 216 0 540 756 ... 0 0 0 1 1 0 0 0 0 0
4 60 14260 8 5 2000 2000 655 0 490 1145 ... 0 0 0 1 0 0 0 0 1 0

5 rows × 159 columns

Alternatively, you could have dropped the categoricals.
To see how the approaches compare, we can calculate the mean absolute error of models that are built with two alternative sets of predictors.

  1. One-hot encoded cateoricals as well as numeric predictors.
  2. Numerical predictors, where we drop categoricals.

One-hot encoding usually helps, but it varies on a case-by-case basis.
In this case, there appears to be little meaningful benefit from using one-hot encoded variables.


In [29]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    # Convention is to return a positive MAE score, so multiply by -1 .
    return -1 * cross_val_score(RandomForestRegressor(50), X, y, scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])
mae_without_categoricals = get_mae(predictors_without_categoricals, target)
mae_without_categoricals


Out[29]:
18243.417545370303

In [30]:
mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)
mae_one_hot_encoded


Out[30]:
17882.902608000048

Applying to Multiple Files

So far, you've one-hot-encoded your training data.
What do you do when you have multiple files (e.g. a test dataset, or some other data that you'd like to make predictions for)?
Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test datasets get misaligned, your results will be useless, maybe even misleading.
This could happen if a categorical had a different number of values in the training data vs the test data.
Ensure the test data is encoded in the same manner as the training data with the align command:


In [31]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors, 
                                                                    join='left', axis=1)

In [32]:
final_train[:5]


Out[32]:
MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 60 8450 7 5 2003 2003 706 0 150 856 ... 0 0 0 1 0 0 0 0 1 0
1 20 9600 6 8 1976 1976 978 0 284 1262 ... 0 0 0 1 0 0 0 0 1 0
2 60 11250 7 5 2001 2002 486 0 434 920 ... 0 0 0 1 0 0 0 0 1 0
3 70 9550 7 5 1915 1970 216 0 540 756 ... 0 0 0 1 1 0 0 0 0 0
4 60 14260 8 5 2000 2000 655 0 490 1145 ... 0 0 0 1 0 0 0 0 1 0

5 rows × 159 columns


In [33]:
final_test[:5]


Out[33]:
MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 20 11622 5 6 1961 1961 468.0 144.0 270.0 882.0 ... 0 0 0 1 0 0 0 0 1 0
1 20 14267 6 6 1958 1958 923.0 0.0 406.0 1329.0 ... 0 0 0 1 0 0 0 0 1 0
2 60 13830 5 5 1997 1998 791.0 0.0 137.0 928.0 ... 0 0 0 1 0 0 0 0 1 0
3 60 9978 6 6 1998 1998 602.0 0.0 324.0 926.0 ... 0 0 0 1 0 0 0 0 1 0
4 120 5005 8 5 1992 1992 263.0 0.0 1017.0 1280.0 ... 0 0 0 1 0 0 0 0 1 0

5 rows × 159 columns

The align command makes sure the columns show up in the same order in both datasets (it uses column names to identify which columns line up in each dataset.)
The argument join='left' specifies that we will do the equivalent of SQL's left join.
That means, if there are ever columns that show up in one dataset and not the other, we will keep the columns from our training data.
The argument join='inner' would do what SQL databases call an inner join, keeping only the columns showing up in both datasets.
That can also a be sensible choice.

Conclusion

The world is filled with categorical data.
You will be a much more effective data scientist if you know how to use this data.
Here are resources that will be useful as you start doing more sophisticated work with categorical data.

  • Pipelines: Deploying models into production ready systems is a topic unto itself. While one-hot encoding is still a great approach, your code will need to built in an especially robust way. Scikit-learn pipelines are a great tool for this. Scikit-learn offers a class for one-hot encoding and this can be added to a Pipeline. Unfortunately, it doesn't handle text or object values, which is a common use case.
  • Applications To Text for Deep Learning: Keras and TensorFlow have fuctionality for one-hot encoding, which is useful for working with text.
  • Categoricals with Many Values: Scikit-learn's FeatureHasher uses the hashing trick to store high-dimensional data. This will add some complexity to your modeling code.