Note about variable scope in Jupyter Notebooks (.ipynb files)

When a variable is declared in a cell and that cell is run, the variable enters the global namespace for that notebook.
If you want to restrict variables to a local scope, wrap them in a function:



In [18]:

    
def foo():
    y = 'cuz_i'
    return y

foo()









    Out[18]:





'cuz_i'



In [19]:

    
def bar():
    y = 'said_so'
    return y

bar()









    Out[19]:





'said_so'

If you aren't careful about variable scope, confusion may ensue; use caution.
Now on to:

The scoop on missing values.

There are many ways data can end up with missing values.

A 2 bedroom house doesn't have value for a third bedroom.
Someone being surveyed may choose not to share their income.

Python libraries represent missing numbers as nan which is short for "not a number".
You can detect which cells have missing values, and then count how many there are in each column with the command:

print(data.isnull().sum())

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.
Let's figure out how to deal with them.

1. You can drop columns with missing values:

data_without_missing_values = original_data.dropna(axis=1)

If you want to drop the same columns from the DataFrames in both your training dataset and test dataset:

cols_with_missing_vals = [col for col in original_data.columns if original_data[col].isnull().any()] reduced_original_data = original_data.drop(cols_with_missing, axis=1) reduced_test_data = test_data.drop(cols_with_missing, axis=1)

This method discards all information in the entire column, so it can be useful when most values in a column are missing.

2. You can impute missing values:

Imputation replaces the missing value with some number (the mean, for example), which usually gives more accurate models than dropping the column entirely.

from sklearn.preprocessing import Imputer my_imputer = Imputer() data_with_imputed_values = my_imputer.fit_transform(original_data)

Imputation can also be included in a scikit-learn Pipeline, which simplify model building, validation, and deployment.

3. You can extend imputation to consider which values were originally missing:

Imputation is the standard approach, and it usually works well.
However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset).
Or rows with missing values may be unique in some other way.
In that case, your model would make better predictions by considering which values were originally missing.
Here's how it might look:

# Make a copy to keep the original data intact: new_data = original_data.copy() # Make new columns indicating what will be imputed: cols_with_missing = [col for col in new_data.columns if new_data[c].isnull().any()] # Compare and contrast: for col in cols_with_missing: new_data[col + ' was missing'] = new_data[col].isnull() # Impute: my_imputer = Imputer() new_data = my_imputer.fit_transform(new_data)

This approach may or may not improve the results compared to simply imputing values.

An example comparing the solutions using the Melbourne Housing data.

We will see an example predicting housing prices from the Melbourne Housing data.



In [20]:

    
import pandas as pd

mb_data = pd.read_csv('input/melbourne_data.csv')



In [21]:

    
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split

mb_target = mb_data.Price
mb_predictors = mb_data.drop(['Price'], axis=1)

# In order to simplify this example, only numeric predictors are used.
mb_numeric_predictors = mb_predictors.select_dtypes(exclude=['object'])

Create a function to measure how well each approach performs:

We divide our data into training and test.
We've loaded a function score_dataset(X_train, X_test, y_train, y_test) to compare the quality of different approaches to missing values.
This function reports the out-of-sample MAE score from a RandomForest.



In [22]:

    
X_train, X_test, y_train, y_test = train_test_split(mb_numeric_predictors, mb_target, train_size=0.7,
                                                   test_size=0.3, random_state=0)

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mae(y_test, preds)

Dropping columns with missing values:



In [23]:

    
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error after dropping columns with missing values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))









    



Mean Absolute Error after dropping columns with missing values:
349491.40597934404

Get model score from imputation:



In [24]:

    
from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error after imputing misssing values:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))









    



Mean Absolute Error after imputing misssing values:
204969.41964123936

Get score after imputation and display imputed values:



In [25]:

    
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

# https://www.python.org/dev/peps/pep-0289/
cols_with_missing = (col for col in X_train.columns if X_train[col].isnull().any())

for col in cols_with_missing:
    imputed_X_train_plus[col + ' was missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + ' was missing'] = imputed_X_test_plus[col].isnull()
    
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error while tracking imputed values:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))









    



Mean Absolute Error while tracking imputed values:
203061.2769885849

The difference between imputation and imputation with extension is trivial compared to dropping entire columns.

Categorical Data and One-Hot Encoding

Categorical data is data that takes only a limited number of values.
For example, if you people responded to a survey about which what brand of car they owned, the result would be categorical (because the answers would be things like Honda, Toyota, Ford, None).
Responses fall into a fixed set of categories.
You will get an error if you try to plug these variables into most machine learning models in Python without "encoding" them first.
Here we'll show the most popular method for encoding categorical variables.

One hot encoding is the most widespread approach, and it works very well unless your categorical variables take on a large number of values.
You generally won't use it for variables taking more than about a dozen different values.
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data.

The values in the original data are Red, Yellow, and Green.
We create a separate column for each possible value.
Wherever the original value was Red, a 1 is entered into the Red column, and so forth.

Example

Let's begin at the point where the train_predictors and test_predictors DataFrames set up.
This data contains the housing characteristics.
You will use them to predict home prices, which are stored in a pandas Series called target.



In [26]:

    
import pandas as pd

train_data = pd.read_csv('input/train.csv')
test_data = pd.read_csv('input/test.csv')

# Disregard the houses that are missing the target value:
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)

target = train_data.SalePrice

# Keep this simple (and inaccurate, remember) and just drop columns with missing values.
# Try experimenting with different and better ways to go about this.
cols_with_missing = [col for col in train_data.columns if train_data[col].isnull().any()]

# Let's get some predictors chosen now:
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)

# Cardinality == number of unique values in a column.
# This is a convenient (and arbitrary) way of selecting categorical columns.

low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]

numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]

my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

Pandas assigns a data type dtype to each Series (column).
Let's take a look at a random sample of dtypes from the prediction data:



In [27]:

    
train_predictors.dtypes.sample(10)









    Out[27]:





YearBuilt        int64
BsmtHalfBath     int64
MSSubClass       int64
ScreenPorch      int64
MSZoning        object
GarageCars       int64
TotRmsAbvGrd     int64
CentralAir      object
FullBath         int64
OverallCond      int64
dtype: object

Object usually indicates a column has text.
It's common to one-hot encode these "object" columns, since they can't be plugged directly into most models.
With pandas you can use the get_dummies() function for one-hot encoding.



In [28]:

    
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_training_predictors[:5]









    Out[28]:







  
    
      
      MSSubClass
      LotArea
      OverallQual
      OverallCond
      YearBuilt
      YearRemodAdd
      BsmtFinSF1
      BsmtFinSF2
      BsmtUnfSF
      TotalBsmtSF
      ...
      SaleType_ConLw
      SaleType_New
      SaleType_Oth
      SaleType_WD
      SaleCondition_Abnorml
      SaleCondition_AdjLand
      SaleCondition_Alloca
      SaleCondition_Family
      SaleCondition_Normal
      SaleCondition_Partial
    
  
  
    
      0
      60
      8450
      7
      5
      2003
      2003
      706
      0
      150
      856
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      1
      20
      9600
      6
      8
      1976
      1976
      978
      0
      284
      1262
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      2
      60
      11250
      7
      5
      2001
      2002
      486
      0
      434
      920
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      3
      70
      9550
      7
      5
      1915
      1970
      216
      0
      540
      756
      ...
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
    
    
      4
      60
      14260
      8
      5
      2000
      2000
      655
      0
      490
      1145
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
  

5 rows × 159 columns

Alternatively, you could have dropped the categoricals.
To see how the approaches compare, we can calculate the mean absolute error of models that are built with two alternative sets of predictors.

One-hot encoded cateoricals as well as numeric predictors.
Numerical predictors, where we drop categoricals.

One-hot encoding usually helps, but it varies on a case-by-case basis.
In this case, there appears to be little meaningful benefit from using one-hot encoded variables.



In [29]:

    
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    # Convention is to return a positive MAE score, so multiply by -1 .
    return -1 * cross_val_score(RandomForestRegressor(50), X, y, scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])
mae_without_categoricals = get_mae(predictors_without_categoricals, target)
mae_without_categoricals









    Out[29]:





18243.417545370303



In [30]:

    
mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)
mae_one_hot_encoded









    Out[30]:





17882.902608000048

Applying to Multiple Files

So far, you've one-hot-encoded your training data.
What do you do when you have multiple files (e.g. a test dataset, or some other data that you'd like to make predictions for)?
Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test datasets get misaligned, your results will be useless, maybe even misleading.
This could happen if a categorical had a different number of values in the training data vs the test data.
Ensure the test data is encoded in the same manner as the training data with the align command:



In [31]:

    
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors, 
                                                                    join='left', axis=1)



In [32]:

    
final_train[:5]









    Out[32]:







  
    
      
      MSSubClass
      LotArea
      OverallQual
      OverallCond
      YearBuilt
      YearRemodAdd
      BsmtFinSF1
      BsmtFinSF2
      BsmtUnfSF
      TotalBsmtSF
      ...
      SaleType_ConLw
      SaleType_New
      SaleType_Oth
      SaleType_WD
      SaleCondition_Abnorml
      SaleCondition_AdjLand
      SaleCondition_Alloca
      SaleCondition_Family
      SaleCondition_Normal
      SaleCondition_Partial
    
  
  
    
      0
      60
      8450
      7
      5
      2003
      2003
      706
      0
      150
      856
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      1
      20
      9600
      6
      8
      1976
      1976
      978
      0
      284
      1262
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      2
      60
      11250
      7
      5
      2001
      2002
      486
      0
      434
      920
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      3
      70
      9550
      7
      5
      1915
      1970
      216
      0
      540
      756
      ...
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
    
    
      4
      60
      14260
      8
      5
      2000
      2000
      655
      0
      490
      1145
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
  

5 rows × 159 columns



In [33]:

    
final_test[:5]









    Out[33]:







  
    
      
      MSSubClass
      LotArea
      OverallQual
      OverallCond
      YearBuilt
      YearRemodAdd
      BsmtFinSF1
      BsmtFinSF2
      BsmtUnfSF
      TotalBsmtSF
      ...
      SaleType_ConLw
      SaleType_New
      SaleType_Oth
      SaleType_WD
      SaleCondition_Abnorml
      SaleCondition_AdjLand
      SaleCondition_Alloca
      SaleCondition_Family
      SaleCondition_Normal
      SaleCondition_Partial
    
  
  
    
      0
      20
      11622
      5
      6
      1961
      1961
      468.0
      144.0
      270.0
      882.0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      1
      20
      14267
      6
      6
      1958
      1958
      923.0
      0.0
      406.0
      1329.0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      2
      60
      13830
      5
      5
      1997
      1998
      791.0
      0.0
      137.0
      928.0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      3
      60
      9978
      6
      6
      1998
      1998
      602.0
      0.0
      324.0
      926.0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
    
      4
      120
      5005
      8
      5
      1992
      1992
      263.0
      0.0
      1017.0
      1280.0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
    
  

5 rows × 159 columns

The align command makes sure the columns show up in the same order in both datasets (it uses column names to identify which columns line up in each dataset.)
The argument join='left' specifies that we will do the equivalent of SQL's left join.
That means, if there are ever columns that show up in one dataset and not the other, we will keep the columns from our training data.
The argument join='inner' would do what SQL databases call an inner join, keeping only the columns showing up in both datasets.
That can also a be sensible choice.

Conclusion

The world is filled with categorical data.
You will be a much more effective data scientist if you know how to use this data.
Here are resources that will be useful as you start doing more sophisticated work with categorical data.

Pipelines: Deploying models into production ready systems is a topic unto itself. While one-hot encoding is still a great approach, your code will need to built in an especially robust way. Scikit-learn pipelines are a great tool for this. Scikit-learn offers a class for one-hot encoding and this can be added to a Pipeline. Unfortunately, it doesn't handle text or object values, which is a common use case.
Applications To Text for Deep Learning: Keras and TensorFlow have fuctionality for one-hot encoding, which is useful for working with text.
Categoricals with Many Values: Scikit-learn's FeatureHasher uses the hashing trick to store high-dimensional data. This will add some complexity to your modeling code.

	MSSubClass	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	BsmtFinSF1	BsmtUnfSF	TotalBsmtSF	...	SaleType_WD	SaleCondition_Abnorml	SaleCondition_Normal
0	60	8450	7	5	2003	2003	706	150	856	...	1	0	1
1	20	9600	6	8	1976	1976	978	284	1262	...	1	0	1
2	60	11250	7	5	2001	2002	486	434	920	...	1	0	1
3	70	9550	7	5	1915	1970	216	540	756	...	1	1	0
4	60	14260	8	5	2000	2000	655	490	1145	...	1	0	1

	MSSubClass	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	...	SaleType_WD	SaleCondition_Normal
0	20	11622	5	6	1961	1961	468.0	144.0	270.0	882.0	...	1	1
1	20	14267	6	6	1958	1958	923.0	0.0	406.0	1329.0	...	1	1
2	60	13830	5	5	1997	1998	791.0	0.0	137.0	928.0	...	1	1
3	60	9978	6	6	1998	1998	602.0	0.0	324.0	926.0	...	1	1
4	120	5005	8	5	1992	1992	263.0	0.0	1017.0	1280.0	...	1	1

Handling Missing Values