In [1]:
#!/usr/bin/env python
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')
In [2]:
## create some data
X,y = make_classification(n_samples=50, n_features=5)
print(X.mean(axis=0))
In [4]:
## make a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y)
## scale using sklearn
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_1 = scaler.transform(X_train)
X_test_1 = scaler.transform(X_test)
## scale without sklearn (just so you know what is happening)
X_train_2 = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)
X_test_2 = (X_test - X_train.mean(axis=0)) / X_train.std(axis=0)
## are they equal?
print(np.array_equal(X_train_1.mean(axis=0),X_train_2.mean(axis=0)))
print(np.array_equal(X_test_1.mean(axis=0),X_test_2.mean(axis=0)))
Missing data is a common problem in most real-world scientific datasets. While the best way for dealing with missing data will always be preventing their occurrence in the first place, it usually can't be helped, particularly when data are collected passively or voluntarily, or when data collection and recording is distributed among several people. There are a variety of ways for dealing with missing data, from the very naïve to the very sophisticated, and unfortunately the more common approaches tend to be ad hoc and will usually do more harm than good.
It turns out that more robust methods for imputation are not as difficult to implement as they first appear to be. Two of the best ones are Bayesian imputation and multiple imputation. In this section, we will use multiple imputation to account for missing data in a regression analysis.
As a motivating example, we will use a dataset of educational outcomes for children with hearing impairment. Here, we are interested in determining factors that are associated with better or poorer learning outcomes.
There is a suite of available predictors, including:
male
)siblings
)family_inv
)non_english
)prev_disab
)non_white
)age_test
)non_severe_hl
)mother_hs
)early_ident
).
In [7]:
test_scores = pd.read_csv('test_scores.csv', index_col=0)
test_scores.head()
Out[7]:
In [8]:
test_scores.info()
In [9]:
test_scores.isnull().sum(0)
Out[9]:
The easiest (and worst) way to deal with missing data is to ignore it. That is, simply run the analysis, missing values and all, hoping for the best. If your software is any good, this approach will simply not work; the algorithm will try to operate on data that includes missing values, and propagate them, resulting in statistics and estimates that cannot be calculated, which will typically raise errors. If your software is poor, it will make some assumption or decision about the missing values, and proceed to generate results conditional on the assumption, which creates problems that may never be detected because no indication was given to any potential problem.
The next easiest (worst) approach to analyzing data with missing values is to conduct list-wise deletion, by deleting the records that have missing values. This is called complete case analysis, because only records that are complete get retained for the analysis. The degree to which complete case analysis is undesirable depends on the mechanism by which data have become missing.
Think about Bias and Power
The very best-case scenario for using complete case analysis, which corresponds to MCAR missingness, results in a loss of power due to the reduction in sample size. The analysis will lose the information contained in the non-missing elements of a partially-missing record. When data are not missing completely at random, inferences from complete case analysis may be biased due to systematic differences between missing and non-missing records that affects the estimates of key parameters.
One alternative to complete case analysis is to simply fill (impute) the missing values with a reasonable guess a the true value, such as the mean, median or modal value of the fully-observed records. This imputation, while not recovering any information regarding the missing value itself for use in the analysis, does provide a mechanism for including the non-missing values in the analysis, thereby making use of all available information.
The Imputer
class in scikit-learn provides methods for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. This class also allows for different missing value encodings.
For example, we can replace missing entries encoded as np.nan
using the mean value of the columns (axis 0) that contain the missing values:
In [12]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
simple_mat = [[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]
print(simple_mat)
imp_mean.fit(simple_mat)
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X))
In [ ]:
In [ ]:
In [17]:
## use mode imputation for test_scores
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mode.fit(test_scores)
print(imp_mode.transform(test_scores))
some times it is easier to use the fillna
method
In [18]:
test_scores.siblings.mean()
Out[18]:
In [19]:
siblings_imputed = test_scores.siblings.fillna(test_scores.siblings.mean())
This approach may be reasonable under the MCAR assumption, but may induce bias under a MAR scenario, whereby missing values may differ systematically relative to non-missing values, making the particular summary statistic used for imputation biased as a mean/median/modal value for the missing values.
Beyond this, the use of a single imputed value to stand in place of the actual missing value glosses over the uncertainty associated with this guess at the true value. Any subsequent analysis procedure (e.g. regression analysis) will behave as if the imputed value were observed, despite the fact that we are actually unsure of the actual value for the missing variable. The practical consequence of this is that the variance of any estimates resulting from the imputed dataset will be artificially reduced.
One robust alternative to addressing missing data is multiple imputation (Schaffer 1999, van Buuren 2012). It produces unbiased parameter estimates, while simultaneously accounting for the uncertainty associated with imputing missing values. It is conceptually and mechanistically straightforward, and produces complete datasets that may be analyzed using any statistical methodology or software one chooses, as if the data had no missing values to begin with.
Multiple imputation generates imputed values based on a regression model. This regression model will help us generate reasonable values, particularly if data are MAR, since it uses information in the dataset that may be informative in predicting what the true value may be. Ideally, we want predictor variables that are correlated with the missing variable, and with the mechanism of missingness, if any. For example, one might be able to use test scores from one subject to predict missing test scores from another; or, the probability of income reporting to be missing may vary systematically according to the education level of the individual.
To see if there is any potential information among the variables in our dataset to use for imputation, it is helpful to calculate the pairwise correlation between all the variables. Since we have discrete variables in our data, the Spearman rank correlation coefficient is appropriate.
In [20]:
test_scores.dropna().corr(method='spearman')
Out[20]:
We will try to impute missing values the mother's high school education indicator variable, which takes values of 0 for no high school diploma, or 1 for high school diploma or greater. The appropriate model to predict binary variables is a logistic regression. We will use the scikit-learn implementation, LogisticRegression
.
In [21]:
from sklearn.linear_model import LogisticRegression
To keep things simple, we will only use variables that are themselves complete to build the predictive model, hence our subset of predictors will exclude family involvement score (family_inv
) and previous disability (prev_disab
).
In [22]:
impute_subset = test_scores.drop(labels=['family_inv','prev_disab','score'], axis=1)
Next, we scale the predictor variables to range from 0 to 1, to improve the performance of the regression model.
In [23]:
y = impute_subset.pop('mother_hs').values
X = preprocessing.StandardScaler().fit_transform(impute_subset.astype(float))
Next, we create a LogisticRegression
model, and fit it using the non-missing observations.
In [24]:
missing = np.isnan(y)
mod = LogisticRegression()
mod.fit(X[~missing], y[~missing])
Out[24]:
In [25]:
mother_hs_pred = mod.predict(X[missing])
mother_hs_pred
Out[25]:
These values can then be inserted in place of the missing values, and an analysis can be performed on the entire dataset.
However, this is still just a single imputation for each missing value, and hence glosses over the uncertainty associated with the derivation of the imputes. Multiple imputation proceeds by imputing several values, to generate several complete datasets and performing the same analysis on all of them. With a set of estimates in hand, an average estimate of model parameters can be obtained that more adequately accounts for the uncertainty, hopefully providing more robust inference than from a single impute.
There are a variety of ways to generate multiple imputations. Here, we will exploit regularization in order to do this. The LogisticRegression
class from scikit-learn provides facilities for regularization using either L2 (resulting in ridge regression) or L1 (resulting in LASSO regression) penalties. The degree of regularization in either case is controlled by the C
parameter, whereby large values of C
give more freedom to the model, while smaller values of C
constrain the model more. We can use a selection of C
values to obtain a range of predictions from variants of the same model. For example:
In [32]:
mod2 = LogisticRegression(C=1, penalty='l1')
mod2.fit(X[~missing], y[~missing])
mod2.predict(X[missing])
Out[32]:
In [33]:
mod3 = LogisticRegression(C=0.4, penalty='l1')
mod3.fit(X[~missing], y[~missing])
mod3.predict(X[missing])
Out[33]:
Surprisingly few imputations are required to acheive reasonable estimates, with 3-10 usually sufficient. We will use 3.
In [34]:
mother_hs_imp = []
for C in 0.1, 0.4, 2:
mod = LogisticRegression(C=C, penalty='l1')
mod.fit(X[~missing], y[~missing])
imputed = mod.predict(X[missing])
mother_hs_imp.append(imputed)
In [35]:
mother_hs_imp
Out[35]:
In [ ]: