Missing value imputation

Preprocessing data

Real world day often contains missing values. As part of the preprocessing step one common task is to handle missing values. This is a summary of the things to consider when preprocessing data.


In [1]:
#!/usr/bin/env python

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

In [2]:
## create some data
X,y = make_classification(n_samples=50, n_features=5)
print(X.mean(axis=0))


[0.01186709 0.07328357 0.10226158 0.16595221 0.16894656]

Scaling


In [4]:
## make a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

## scale using sklearn
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_1 = scaler.transform(X_train)
X_test_1 = scaler.transform(X_test)

## scale without sklearn (just so you know what is happening)
X_train_2 = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)
X_test_2 = (X_test - X_train.mean(axis=0)) / X_train.std(axis=0)

## are they equal?
print(np.array_equal(X_train_1.mean(axis=0),X_train_2.mean(axis=0)))
print(np.array_equal(X_test_1.mean(axis=0),X_test_2.mean(axis=0)))


True
True

Missing Data Imputation

Missing data is a common problem in most real-world scientific datasets. While the best way for dealing with missing data will always be preventing their occurrence in the first place, it usually can't be helped, particularly when data are collected passively or voluntarily, or when data collection and recording is distributed among several people. There are a variety of ways for dealing with missing data, from the very naïve to the very sophisticated, and unfortunately the more common approaches tend to be ad hoc and will usually do more harm than good.

It turns out that more robust methods for imputation are not as difficult to implement as they first appear to be. Two of the best ones are Bayesian imputation and multiple imputation. In this section, we will use multiple imputation to account for missing data in a regression analysis.

As a motivating example, we will use a dataset of educational outcomes for children with hearing impairment. Here, we are interested in determining factors that are associated with better or poorer learning outcomes.

There is a suite of available predictors, including:

  • gender (male)
  • number of siblings in the household (siblings)
  • index of family involvement (family_inv)
  • whether the primary household language is not English (non_english)
  • presence of a previous disability (prev_disab)
  • non-white race (non_white)
  • age at the time of testing (in months, age_test)
  • whether hearing loss is not severe (non_severe_hl)
  • whether the subject's mother obtained a high school diploma or better (mother_hs)
  • whether the hearing impairment was identified by 3 months of age (early_ident).

In [7]:
test_scores = pd.read_csv('test_scores.csv', index_col=0)
test_scores.head()


Out[7]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
0 40 0 2.0 2.0 False NaN 55 1.0 NaN False False
1 31 1 0.0 NaN False 0.0 53 0.0 0.0 False False
2 83 1 1.0 1.0 True 0.0 52 1.0 NaN False True
3 75 0 3.0 NaN False 0.0 55 0.0 1.0 False False
5 62 0 0.0 4.0 False 1.0 50 0.0 NaN False False

In [8]:
test_scores.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 207 entries, 0 to 224
Data columns (total 11 columns):
score            207 non-null int64
male             207 non-null int64
siblings         207 non-null float64
family_inv       174 non-null float64
non_english      207 non-null bool
prev_disab       189 non-null float64
age_test         207 non-null int64
non_severe_hl    207 non-null float64
mother_hs        134 non-null float64
early_ident      207 non-null bool
non_white        207 non-null bool
dtypes: bool(3), float64(5), int64(3)
memory usage: 15.2 KB

In [9]:
test_scores.isnull().sum(0)


Out[9]:
score             0
male              0
siblings          0
family_inv       33
non_english       0
prev_disab       18
age_test          0
non_severe_hl     0
mother_hs        73
early_ident       0
non_white         0
dtype: int64

Strategies for dealing with missing data

Just ignore it

The easiest (and worst) way to deal with missing data is to ignore it. That is, simply run the analysis, missing values and all, hoping for the best. If your software is any good, this approach will simply not work; the algorithm will try to operate on data that includes missing values, and propagate them, resulting in statistics and estimates that cannot be calculated, which will typically raise errors. If your software is poor, it will make some assumption or decision about the missing values, and proceed to generate results conditional on the assumption, which creates problems that may never be detected because no indication was given to any potential problem.

Take only the complete case analysis

The next easiest (worst) approach to analyzing data with missing values is to conduct list-wise deletion, by deleting the records that have missing values. This is called complete case analysis, because only records that are complete get retained for the analysis. The degree to which complete case analysis is undesirable depends on the mechanism by which data have become missing.

Types of Missingness

Think about Bias and Power

  • Missing completely at random (MCAR): When data are MCAR, missing cases are, on average, identical to non-missing cases, with respect to the model. Ignoring the missingness will reduce the power of the analysis, but will not bias inference.
  • Missing at random (MAR): Missing data depends (usually probabilistically) on measured values, and hence can be modeled by variables observed in the data set. Accounting for the values which “cause” the missing data will produce unbiased results in an analysis.
  • Missing not at random(MNAR): Missing data depend on unmeasured or unknown variables. There is no information available to account for the missingness.

The very best-case scenario for using complete case analysis, which corresponds to MCAR missingness, results in a loss of power due to the reduction in sample size. The analysis will lose the information contained in the non-missing elements of a partially-missing record. When data are not missing completely at random, inferences from complete case analysis may be biased due to systematic differences between missing and non-missing records that affects the estimates of key parameters.

One alternative to complete case analysis is to simply fill (impute) the missing values with a reasonable guess a the true value, such as the mean, median or modal value of the fully-observed records. This imputation, while not recovering any information regarding the missing value itself for use in the analysis, does provide a mechanism for including the non-missing values in the analysis, thereby making use of all available information.

The Imputer class in scikit-learn provides methods for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. This class also allows for different missing value encodings.

For example, we can replace missing entries encoded as np.nan using the mean value of the columns (axis 0) that contain the missing values:


In [12]:
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
simple_mat = [[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]
print(simple_mat)
imp_mean.fit(simple_mat)
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X))


[[7, 2, 3], [4, nan, 6], [10, 5, 9]]
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]

In [ ]:


In [ ]:


In [17]:
## use mode imputation for test_scores
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mode.fit(test_scores)
print(imp_mode.transform(test_scores))


[[40 0 2.0 ... 1.0 False False]
 [31 1 0.0 ... 0.0 False False]
 [83 1 1.0 ... 1.0 False True]
 ...
 [71 1 1.0 ... 1.0 False True]
 [118 0 2.0 ... 1.0 True False]
 [99 1 2.0 ... 1.0 True False]]

some times it is easier to use the fillna method


In [18]:
test_scores.siblings.mean()


Out[18]:
1.1256038647342994

In [19]:
siblings_imputed = test_scores.siblings.fillna(test_scores.siblings.mean())

This approach may be reasonable under the MCAR assumption, but may induce bias under a MAR scenario, whereby missing values may differ systematically relative to non-missing values, making the particular summary statistic used for imputation biased as a mean/median/modal value for the missing values.

Beyond this, the use of a single imputed value to stand in place of the actual missing value glosses over the uncertainty associated with this guess at the true value. Any subsequent analysis procedure (e.g. regression analysis) will behave as if the imputed value were observed, despite the fact that we are actually unsure of the actual value for the missing variable. The practical consequence of this is that the variance of any estimates resulting from the imputed dataset will be artificially reduced.

Multiple Imputation

One robust alternative to addressing missing data is multiple imputation (Schaffer 1999, van Buuren 2012). It produces unbiased parameter estimates, while simultaneously accounting for the uncertainty associated with imputing missing values. It is conceptually and mechanistically straightforward, and produces complete datasets that may be analyzed using any statistical methodology or software one chooses, as if the data had no missing values to begin with.

Multiple imputation generates imputed values based on a regression model. This regression model will help us generate reasonable values, particularly if data are MAR, since it uses information in the dataset that may be informative in predicting what the true value may be. Ideally, we want predictor variables that are correlated with the missing variable, and with the mechanism of missingness, if any. For example, one might be able to use test scores from one subject to predict missing test scores from another; or, the probability of income reporting to be missing may vary systematically according to the education level of the individual.

To see if there is any potential information among the variables in our dataset to use for imputation, it is helpful to calculate the pairwise correlation between all the variables. Since we have discrete variables in our data, the Spearman rank correlation coefficient is appropriate.


In [20]:
test_scores.dropna().corr(method='spearman')


Out[20]:
score male siblings family_inv non_english prev_disab age_test non_severe_hl mother_hs early_ident non_white
score 1.000000 0.073063 -0.085044 -0.539019 -0.278798 -0.184426 0.024057 0.140305 0.228500 0.222711 -0.345061
male 0.073063 1.000000 -0.072006 -0.008714 0.053338 -0.052054 -0.081165 0.031825 0.050372 -0.007690 -0.048638
siblings -0.085044 -0.072006 1.000000 0.078471 -0.049989 -0.038020 0.104905 -0.003689 0.096268 0.077318 0.006234
family_inv -0.539019 -0.008714 0.078471 1.000000 0.221696 0.082314 -0.029120 -0.092815 -0.358898 0.006370 0.401617
non_english -0.278798 0.053338 -0.049989 0.221696 1.000000 -0.021996 0.068095 -0.047775 -0.199639 -0.015812 0.225428
prev_disab -0.184426 -0.052054 -0.038020 0.082314 -0.021996 1.000000 0.136604 0.048132 0.137893 0.046592 -0.021367
age_test 0.024057 -0.081165 0.104905 -0.029120 0.068095 0.136604 1.000000 -0.122811 0.016760 0.033789 0.068430
non_severe_hl 0.140305 0.031825 -0.003689 -0.092815 -0.047775 0.048132 -0.122811 1.000000 -0.015996 0.008211 0.028480
mother_hs 0.228500 0.050372 0.096268 -0.358898 -0.199639 0.137893 0.016760 -0.015996 1.000000 0.024411 -0.214209
early_ident 0.222711 -0.007690 0.077318 0.006370 -0.015812 0.046592 0.033789 0.008211 0.024411 1.000000 -0.022854
non_white -0.345061 -0.048638 0.006234 0.401617 0.225428 -0.021367 0.068430 0.028480 -0.214209 -0.022854 1.000000

We will try to impute missing values the mother's high school education indicator variable, which takes values of 0 for no high school diploma, or 1 for high school diploma or greater. The appropriate model to predict binary variables is a logistic regression. We will use the scikit-learn implementation, LogisticRegression.


In [21]:
from sklearn.linear_model import LogisticRegression

To keep things simple, we will only use variables that are themselves complete to build the predictive model, hence our subset of predictors will exclude family involvement score (family_inv) and previous disability (prev_disab).


In [22]:
impute_subset = test_scores.drop(labels=['family_inv','prev_disab','score'], axis=1)

Next, we scale the predictor variables to range from 0 to 1, to improve the performance of the regression model.


In [23]:
y = impute_subset.pop('mother_hs').values
X = preprocessing.StandardScaler().fit_transform(impute_subset.astype(float))

Next, we create a LogisticRegression model, and fit it using the non-missing observations.


In [24]:
missing = np.isnan(y)
mod = LogisticRegression()
mod.fit(X[~missing], y[~missing])


/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[24]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [25]:
mother_hs_pred = mod.predict(X[missing])
mother_hs_pred


Out[25]:
array([1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       0., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1.,
       0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
       1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1.])

These values can then be inserted in place of the missing values, and an analysis can be performed on the entire dataset.

However, this is still just a single imputation for each missing value, and hence glosses over the uncertainty associated with the derivation of the imputes. Multiple imputation proceeds by imputing several values, to generate several complete datasets and performing the same analysis on all of them. With a set of estimates in hand, an average estimate of model parameters can be obtained that more adequately accounts for the uncertainty, hopefully providing more robust inference than from a single impute.

There are a variety of ways to generate multiple imputations. Here, we will exploit regularization in order to do this. The LogisticRegression class from scikit-learn provides facilities for regularization using either L2 (resulting in ridge regression) or L1 (resulting in LASSO regression) penalties. The degree of regularization in either case is controlled by the C parameter, whereby large values of C give more freedom to the model, while smaller values of C constrain the model more. We can use a selection of C values to obtain a range of predictions from variants of the same model. For example:


In [32]:
mod2 = LogisticRegression(C=1, penalty='l1')
mod2.fit(X[~missing], y[~missing])
mod2.predict(X[missing])


Out[32]:
array([ 1.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,
        1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.])

In [33]:
mod3 = LogisticRegression(C=0.4, penalty='l1')
mod3.fit(X[~missing], y[~missing])
mod3.predict(X[missing])


Out[33]:
array([ 1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,
        1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

Surprisingly few imputations are required to acheive reasonable estimates, with 3-10 usually sufficient. We will use 3.


In [34]:
mother_hs_imp = []

for C in 0.1, 0.4, 2:
    
    mod = LogisticRegression(C=C, penalty='l1')
    mod.fit(X[~missing], y[~missing])
    imputed = mod.predict(X[missing])
    mother_hs_imp.append(imputed)

In [35]:
mother_hs_imp


Out[35]:
[array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 array([ 1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
         0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,
         1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
         1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 array([ 1.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
         0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,
         1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
         1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.])]

In [ ]: