In [18]:
def foo():
y = 'cuz_i'
return y
foo()
Out[18]:
In [19]:
def bar():
y = 'said_so'
return y
bar()
Out[19]:
If you aren't careful about variable scope, confusion may ensue; use caution.
Now on to:
There are many ways data can end up with missing values.
Python libraries represent missing numbers as nan
which is short for "not a number".
You can detect which cells have missing values, and then count how many there are in each column with the command:
Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.
Let's figure out how to deal with them.
If you want to drop the same columns from the DataFrames in both your training dataset and test dataset:
This method discards all information in the entire column, so it can be useful when most values in a column are missing.
Imputation replaces the missing value with some number (the mean, for example), which usually gives more accurate models than dropping the column entirely.
Imputation can also be included in a scikit-learn Pipeline, which simplify model building, validation, and deployment.
Imputation is the standard approach, and it usually works well.
However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset).
Or rows with missing values may be unique in some other way.
In that case, your model would make better predictions by considering which values were originally missing.
Here's how it might look:
This approach may or may not improve the results compared to simply imputing values.
We will see an example predicting housing prices from the Melbourne Housing data.
In [20]:
import pandas as pd
mb_data = pd.read_csv('input/melbourne_data.csv')
In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split
mb_target = mb_data.Price
mb_predictors = mb_data.drop(['Price'], axis=1)
# In order to simplify this example, only numeric predictors are used.
mb_numeric_predictors = mb_predictors.select_dtypes(exclude=['object'])
We divide our data into training and test.
We've loaded a function score_dataset(X_train, X_test, y_train, y_test)
to compare the quality of different approaches to missing values.
This function reports the out-of-sample MAE score from a RandomForest.
In [22]:
X_train, X_test, y_train, y_test = train_test_split(mb_numeric_predictors, mb_target, train_size=0.7,
test_size=0.3, random_state=0)
def score_dataset(X_train, X_test, y_train, y_test):
model = RandomForestRegressor()
model.fit(X_train, y_train)
preds = model.predict(X_test)
return mae(y_test, preds)
In [23]:
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error after dropping columns with missing values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))
In [24]:
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error after imputing misssing values:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))
In [25]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
# https://www.python.org/dev/peps/pep-0289/
cols_with_missing = (col for col in X_train.columns if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + ' was missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + ' was missing'] = imputed_X_test_plus[col].isnull()
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
print("Mean Absolute Error while tracking imputed values:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))
The difference between imputation and imputation with extension is trivial compared to dropping entire columns.
Categorical data is data that takes only a limited number of values.
For example, if you people responded to a survey about which what brand of car they owned, the result would be categorical (because the answers would be things like Honda, Toyota, Ford, None).
Responses fall into a fixed set of categories.
You will get an error if you try to plug these variables into most machine learning models in Python without "encoding" them first.
Here we'll show the most popular method for encoding categorical variables.
One hot encoding is the most widespread approach, and it works very well unless your categorical variables take on a large number of values.
You generally won't use it for variables taking more than about a dozen different values.
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data.
The values in the original data are Red, Yellow, and Green.
We create a separate column for each possible value.
Wherever the original value was Red, a 1 is entered into the Red column, and so forth.
Let's begin at the point where the train_predictors and test_predictors DataFrames set up.
This data contains the housing characteristics.
You will use them to predict home prices, which are stored in a pandas Series called target.
In [26]:
import pandas as pd
train_data = pd.read_csv('input/train.csv')
test_data = pd.read_csv('input/test.csv')
# Disregard the houses that are missing the target value:
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
target = train_data.SalePrice
# Keep this simple (and inaccurate, remember) and just drop columns with missing values.
# Try experimenting with different and better ways to go about this.
cols_with_missing = [col for col in train_data.columns if train_data[col].isnull().any()]
# Let's get some predictors chosen now:
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)
# Cardinality == number of unique values in a column.
# This is a convenient (and arbitrary) way of selecting categorical columns.
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if
candidate_train_predictors[cname].nunique() < 10 and
candidate_train_predictors[cname].dtype == "object"]
numeric_cols = [cname for cname in candidate_train_predictors.columns if
candidate_train_predictors[cname].dtype in ['int64', 'float64']]
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]
Pandas assigns a data type dtype
to each Series (column).
Let's take a look at a random sample of dtypes from the prediction data:
In [27]:
train_predictors.dtypes.sample(10)
Out[27]:
Object usually indicates a column has text.
It's common to one-hot encode these "object" columns, since they can't be plugged directly into most models.
With pandas you can use the get_dummies()
function for one-hot encoding.
In [28]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_training_predictors[:5]
Out[28]:
Alternatively, you could have dropped the categoricals.
To see how the approaches compare, we can calculate the mean absolute error of models that are built with two alternative sets of predictors.
One-hot encoding usually helps, but it varies on a case-by-case basis.
In this case, there appears to be little meaningful benefit from using one-hot encoded variables.
In [29]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
def get_mae(X, y):
# Convention is to return a positive MAE score, so multiply by -1 .
return -1 * cross_val_score(RandomForestRegressor(50), X, y, scoring = 'neg_mean_absolute_error').mean()
predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])
mae_without_categoricals = get_mae(predictors_without_categoricals, target)
mae_without_categoricals
Out[29]:
In [30]:
mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)
mae_one_hot_encoded
Out[30]:
So far, you've one-hot-encoded your training data.
What do you do when you have multiple files (e.g. a test dataset, or some other data that you'd like to make predictions for)?
Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test datasets get misaligned, your results will be useless, maybe even misleading.
This could happen if a categorical had a different number of values in the training data vs the test data.
Ensure the test data is encoded in the same manner as the training data with the align
command:
In [31]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
join='left', axis=1)
In [32]:
final_train[:5]
Out[32]:
In [33]:
final_test[:5]
Out[33]:
The align command makes sure the columns show up in the same order in both datasets (it uses column names to identify which columns line up in each dataset.)
The argument join='left'
specifies that we will do the equivalent of SQL's left join.
That means, if there are ever columns that show up in one dataset and not the other, we will keep the columns from our training data.
The argument join='inner'
would do what SQL databases call an inner join, keeping only the columns showing up in both datasets.
That can also a be sensible choice.
The world is filled with categorical data.
You will be a much more effective data scientist if you know how to use this data.
Here are resources that will be useful as you start doing more sophisticated work with categorical data.
FeatureHasher
uses the hashing trick to store high-dimensional data. This will add some complexity to your modeling code.