In [1]:
### Example taken from :: https://machinelearningmastery.com/handle-missing-data-python/
1. Pima Indians Diabetes Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Class variable (0 or 1).

In [2]:
import pandas as pd

Out[2]:
0 1 2 3 4 5 6 7 8
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

In [3]:
print(dataset.describe())

0           1           2           3           4           5  \
count  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000
mean     3.845052  120.894531   69.105469   20.536458   79.799479   31.992578
std      3.369578   31.972618   19.355807   15.952218  115.244002    7.884160
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
25%      1.000000   99.000000   62.000000    0.000000    0.000000   27.300000
50%      3.000000  117.000000   72.000000   23.000000   30.500000   32.000000
75%      6.000000  140.250000   80.000000   32.000000  127.250000   36.600000
max     17.000000  199.000000  122.000000   99.000000  846.000000   67.100000

6           7           8
count  768.000000  768.000000  768.000000
mean     0.471876   33.240885    0.348958
std      0.331329   11.760232    0.476951
min      0.078000   21.000000    0.000000
25%      0.243750   24.000000    0.000000
50%      0.372500   29.000000    0.000000
75%      0.626250   41.000000    1.000000
max      2.420000   81.000000    1.000000

This is useful.

We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

1: Plasma glucose concentration 2: Diastolic blood pressure 3: Triceps skinfold thickness 4: 2-Hour serum insulin 5: Body mass index

###### #

We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

In [4]:
print((dataset[[1,2,3,4,5]]== 0).sum())

1      5
2     35
3    227
4    374
5     11
dtype: int64

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.

Values with a NaN value are ignored from operations like sum, count, etc.

We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.

After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.

In [6]:
import numpy as np
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN )

# count the number of NaN values in each column
print(dataset.isnull().sum())

0      0
1      5
2     35
3    227
4    374
5     11
6      0
7      0
8      0
dtype: int64
1. Missing Values Causes Problems

Having missing values in a dataset can cause errors with some machine learning algorithms.

In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.

This is an algorithm that does not work when there are missing values in the dataset.

The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.

In [8]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# split dataset into inputs and outputs
values = dataset.values   ## generated matrix out of the dataframe
X = values[:,:8]
y = values[:,8]
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()

kfold = KFold(n_splits =3 , random_state = 7)
result = cross_val_score(model,X, y, cv= kfold, scoring = 'accuracy' )
print('result.mean():', result.mean())

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-8def58b992d1> in <module>()
11
12 kfold = KFold(n_splits =3 , random_state = 7)
---> 13 result = cross_val_score(model,X, y, cv= kfold, scoring = 'accuracy' )
14 print('result.mean():', result.mean())

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
138                                               train, test, verbose, None,
139                                               fit_params)
--> 140                       for train, test in cv_iter)
141     return np.array(scores)[:, 0]
142

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
756             # was dispatched. In particular this covers the edge
757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
759                 self._iterating = True
760             else:

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
606                 return False
607             else:
609                 return True
610

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
569         dispatch_timestamp = time.time()
570         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571         job = self._backend.apply_async(batch, callback=cb)
572         self._jobs.append(job)
573

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
107     def apply_async(self, func, callback=None):
108         """Schedule a func to be run"""
--> 109         result = ImmediateResult(func)
110         if callback:
111             callback(result)

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
324         # Don't delay the application, to avoid keeping the input
325         # arguments in memory
--> 326         self.results = batch()
327
328     def get(self):

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
129
130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133     def __len__(self):

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0)
129
130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133     def __len__(self):

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
236             estimator.fit(X_train, **fit_params)
237         else:
--> 238             estimator.fit(X_train, y_train, **fit_params)
239
240     except Exception as e:

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py in fit(self, X, y, store_covariance, tol)
440                           DeprecationWarning)
441             self.tol = tol
--> 442         X, y = check_X_y(X, y, ensure_min_samples=2, estimator=self)
443         self.classes_ = unique_labels(y)
444

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
519     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
520                     ensure_2d, allow_nd, ensure_min_samples,
--> 521                     ensure_min_features, warn_on_dtype, estimator)
522     if multi_output:
523         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
405                              % (array.ndim, estimator_name))
406         if force_all_finite:
--> 407             _assert_all_finite(array)
408
409     shape_repr = _shape_repr(array.shape)

C:\Users\prassha\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X)
56             and not np.isfinite(X).all()):
57         raise ValueError("Input contains NaN, infinity"
---> 58                          " or a value too large for %r." % X.dtype)
59
60

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

This is as we expect.

We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values.

Now, we can look at methods to handle the missing values.

###### #
1. Remove Rows With Missing Values

The simplest strategy for handling missing data is to remove records that contain a missing value.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.

Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:

In [10]:
# drop rows with missing values
dataset.dropna(inplace = True)
# summarize the number of rows and columns in the dataset
print(dataset.shape)

(392, 9)

We now have a dataset that we could use to evaluate an algorithm sensitive to missing values like LDA.

In [11]:
values = dataset.values   ## generated matrix out of the dataframe
X = values[:,:8]
y = values[:,8]
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()

kfold = KFold(n_splits =3 , random_state = 7)
result = cross_val_score(model,X, y, cv= kfold, scoring = 'accuracy' )
print('result.mean():', result.mean())

result.mean(): 0.78582892934

Removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.

In [13]:
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN)

# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace = True)

# count the number of NaN values in each column
print(dataset.isnull().sum())

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64

**The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values.

It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The Imputer class operates directly on the NumPy array instead of the DataFrame.

The example below uses the Imputer class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix. **

In [18]:
from sklearn.preprocessing import Imputer
#values = dataset.values   ## generated matrix out of the dataframe
#X = values[:,:8]
#y = values[:,8]
imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis =0)
imp.fit(dataset)   ######## working
print(dataset.isnull().sum())  ###### working

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64

In [27]:
import pandas as pd

y = df[8]
X = df.drop(8, axis =1)

#print((X[[1,2,3,4,5]]== 0).sum())   # works
print((X == 0).sum()) # also works
imp = Imputer(missing_values = 0, strategy = 'mean', axis =0)
imp.fit(X)   ######## working
print((X == 0).sum())  ###### working

0    111
1      5
2     35
3    227
4    374
5     11
6      0
7      0
dtype: int64
0    111
1      5
2     35
3    227
4    374
5     11
6      0
7      0
dtype: int64
Out[27]:
0 1 2 3 4 5 6 7
0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33

In [34]:
import pandas as pd

y = df[8]
X = df.drop(8, axis =1)

print((X == 0).sum()) # also works
imp = Imputer(missing_values = 0, strategy = 'mean', axis =0)
X_imp = imp.fit_transform(X)
print((X_imp == 0).sum())

0    111
1      5
2     35
3    227
4    374
5     11
6      0
7      0
dtype: int64
0
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-bb4eaf556e48> in <module>()
9 X_imp = imp.fit_transform(X)
10 print((X_imp == 0).sum())

AttributeError: 'numpy.ndarray' object has no attribute 'head'

In [ ]:
'''
https://stackoverflow.com/questions/33660836/impute-entire-dataframe-all-columns-using-scikit-learn-sklearn-without-itera
If you want the mean or median you could do something like:

fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
imputed_DF = pd.DataFrame(fill_NaN.fit_transform(DF))
imputed_DF.columns = DF.columns
imputed_DF.index = DF.index
If you want to fill them with 0s or something you could always just do:

DF[DF.isnull()] = 0
'''

In [40]:
import pandas as pd

#values = df.values
#X = values[:,0:8]
#y = values[:,8]
y = df[8]
X = df.drop(8, axis =1)
print(X)

0    1   2   3    4     5      6   7
0     6  148  72  35    0  33.6  0.627  50
1     1   85  66  29    0  26.6  0.351  31
2     8  183  64   0    0  23.3  0.672  32
3     1   89  66  23   94  28.1  0.167  21
4     0  137  40  35  168  43.1  2.288  33
5     5  116  74   0    0  25.6  0.201  30
6     3   78  50  32   88  31.0  0.248  26
7    10  115   0   0    0  35.3  0.134  29
8     2  197  70  45  543  30.5  0.158  53
9     8  125  96   0    0   0.0  0.232  54
10    4  110  92   0    0  37.6  0.191  30
11   10  168  74   0    0  38.0  0.537  34
12   10  139  80   0    0  27.1  1.441  57
13    1  189  60  23  846  30.1  0.398  59
14    5  166  72  19  175  25.8  0.587  51
15    7  100   0   0    0  30.0  0.484  32
16    0  118  84  47  230  45.8  0.551  31
17    7  107  74   0    0  29.6  0.254  31
18    1  103  30  38   83  43.3  0.183  33
19    1  115  70  30   96  34.6  0.529  32
20    3  126  88  41  235  39.3  0.704  27
21    8   99  84   0    0  35.4  0.388  50
22    7  196  90   0    0  39.8  0.451  41
23    9  119  80  35    0  29.0  0.263  29
24   11  143  94  33  146  36.6  0.254  51
25   10  125  70  26  115  31.1  0.205  41
26    7  147  76   0    0  39.4  0.257  43
27    1   97  66  15  140  23.2  0.487  22
28   13  145  82  19  110  22.2  0.245  57
29    5  117  92   0    0  34.1  0.337  38
..   ..  ...  ..  ..  ...   ...    ...  ..
738   2   99  60  17  160  36.6  0.453  21
739   1  102  74   0    0  39.5  0.293  42
740  11  120  80  37  150  42.3  0.785  48
741   3  102  44  20   94  30.8  0.400  26
742   1  109  58  18  116  28.5  0.219  22
743   9  140  94   0    0  32.7  0.734  45
744  13  153  88  37  140  40.6  1.174  39
745  12  100  84  33  105  30.0  0.488  46
746   1  147  94  41    0  49.3  0.358  27
747   1   81  74  41   57  46.3  1.096  32
748   3  187  70  22  200  36.4  0.408  36
749   6  162  62   0    0  24.3  0.178  50
750   4  136  70   0    0  31.2  1.182  22
751   1  121  78  39   74  39.0  0.261  28
752   3  108  62  24    0  26.0  0.223  25
753   0  181  88  44  510  43.3  0.222  26
754   8  154  78  32    0  32.4  0.443  45
755   1  128  88  39  110  36.5  1.057  37
756   7  137  90  41    0  32.0  0.391  39
757   0  123  72   0    0  36.3  0.258  52
758   1  106  76   0    0  37.5  0.197  26
759   6  190  92   0    0  35.5  0.278  66
760   2   88  58  26   16  28.4  0.766  22
761   9  170  74  31    0  44.0  0.403  43
762   9   89  62   0    0  22.5  0.142  33
763  10  101  76  48  180  32.9  0.171  63
764   2  122  70  27    0  36.8  0.340  27
765   5  121  72  23  112  26.2  0.245  30
766   1  126  60   0    0  30.1  0.349  47
767   1   93  70  31    0  30.4  0.315  23

[768 rows x 8 columns]

In [42]:
imp = Imputer(missing_values = 0, strategy = 'mean', axis =0)
X_imp = pd.DataFrame(imp.fit_transform(X))
X_imp.columns = X.columns
X_imp.index = X.index
print((X_imp == 0).sum())
X_imp

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
dtype: int64
Out[42]:
0 1 2 3 4 5 6 7
0 6.000000 148.0 72.000000 35.00000 155.548223 33.600000 0.627 50.0
1 1.000000 85.0 66.000000 29.00000 155.548223 26.600000 0.351 31.0
2 8.000000 183.0 64.000000 29.15342 155.548223 23.300000 0.672 32.0
3 1.000000 89.0 66.000000 23.00000 94.000000 28.100000 0.167 21.0
4 4.494673 137.0 40.000000 35.00000 168.000000 43.100000 2.288 33.0
5 5.000000 116.0 74.000000 29.15342 155.548223 25.600000 0.201 30.0
6 3.000000 78.0 50.000000 32.00000 88.000000 31.000000 0.248 26.0
7 10.000000 115.0 72.405184 29.15342 155.548223 35.300000 0.134 29.0
8 2.000000 197.0 70.000000 45.00000 543.000000 30.500000 0.158 53.0
9 8.000000 125.0 96.000000 29.15342 155.548223 32.457464 0.232 54.0
10 4.000000 110.0 92.000000 29.15342 155.548223 37.600000 0.191 30.0
11 10.000000 168.0 74.000000 29.15342 155.548223 38.000000 0.537 34.0
12 10.000000 139.0 80.000000 29.15342 155.548223 27.100000 1.441 57.0
13 1.000000 189.0 60.000000 23.00000 846.000000 30.100000 0.398 59.0
14 5.000000 166.0 72.000000 19.00000 175.000000 25.800000 0.587 51.0
15 7.000000 100.0 72.405184 29.15342 155.548223 30.000000 0.484 32.0
16 4.494673 118.0 84.000000 47.00000 230.000000 45.800000 0.551 31.0
17 7.000000 107.0 74.000000 29.15342 155.548223 29.600000 0.254 31.0
18 1.000000 103.0 30.000000 38.00000 83.000000 43.300000 0.183 33.0
19 1.000000 115.0 70.000000 30.00000 96.000000 34.600000 0.529 32.0
20 3.000000 126.0 88.000000 41.00000 235.000000 39.300000 0.704 27.0
21 8.000000 99.0 84.000000 29.15342 155.548223 35.400000 0.388 50.0
22 7.000000 196.0 90.000000 29.15342 155.548223 39.800000 0.451 41.0
23 9.000000 119.0 80.000000 35.00000 155.548223 29.000000 0.263 29.0
24 11.000000 143.0 94.000000 33.00000 146.000000 36.600000 0.254 51.0
25 10.000000 125.0 70.000000 26.00000 115.000000 31.100000 0.205 41.0
26 7.000000 147.0 76.000000 29.15342 155.548223 39.400000 0.257 43.0
27 1.000000 97.0 66.000000 15.00000 140.000000 23.200000 0.487 22.0
28 13.000000 145.0 82.000000 19.00000 110.000000 22.200000 0.245 57.0
29 5.000000 117.0 92.000000 29.15342 155.548223 34.100000 0.337 38.0
... ... ... ... ... ... ... ... ...
738 2.000000 99.0 60.000000 17.00000 160.000000 36.600000 0.453 21.0
739 1.000000 102.0 74.000000 29.15342 155.548223 39.500000 0.293 42.0
740 11.000000 120.0 80.000000 37.00000 150.000000 42.300000 0.785 48.0
741 3.000000 102.0 44.000000 20.00000 94.000000 30.800000 0.400 26.0
742 1.000000 109.0 58.000000 18.00000 116.000000 28.500000 0.219 22.0
743 9.000000 140.0 94.000000 29.15342 155.548223 32.700000 0.734 45.0
744 13.000000 153.0 88.000000 37.00000 140.000000 40.600000 1.174 39.0
745 12.000000 100.0 84.000000 33.00000 105.000000 30.000000 0.488 46.0
746 1.000000 147.0 94.000000 41.00000 155.548223 49.300000 0.358 27.0
747 1.000000 81.0 74.000000 41.00000 57.000000 46.300000 1.096 32.0
748 3.000000 187.0 70.000000 22.00000 200.000000 36.400000 0.408 36.0
749 6.000000 162.0 62.000000 29.15342 155.548223 24.300000 0.178 50.0
750 4.000000 136.0 70.000000 29.15342 155.548223 31.200000 1.182 22.0
751 1.000000 121.0 78.000000 39.00000 74.000000 39.000000 0.261 28.0
752 3.000000 108.0 62.000000 24.00000 155.548223 26.000000 0.223 25.0
753 4.494673 181.0 88.000000 44.00000 510.000000 43.300000 0.222 26.0
754 8.000000 154.0 78.000000 32.00000 155.548223 32.400000 0.443 45.0
755 1.000000 128.0 88.000000 39.00000 110.000000 36.500000 1.057 37.0
756 7.000000 137.0 90.000000 41.00000 155.548223 32.000000 0.391 39.0
757 4.494673 123.0 72.000000 29.15342 155.548223 36.300000 0.258 52.0
758 1.000000 106.0 76.000000 29.15342 155.548223 37.500000 0.197 26.0
759 6.000000 190.0 92.000000 29.15342 155.548223 35.500000 0.278 66.0
760 2.000000 88.0 58.000000 26.00000 16.000000 28.400000 0.766 22.0
761 9.000000 170.0 74.000000 31.00000 155.548223 44.000000 0.403 43.0
762 9.000000 89.0 62.000000 29.15342 155.548223 22.500000 0.142 33.0
763 10.000000 101.0 76.000000 48.00000 180.000000 32.900000 0.171 63.0
764 2.000000 122.0 70.000000 27.00000 155.548223 36.800000 0.340 27.0
765 5.000000 121.0 72.000000 23.00000 112.000000 26.200000 0.245 30.0
766 1.000000 126.0 60.000000 29.15342 155.548223 30.100000 0.349 47.0
767 1.000000 93.0 70.000000 31.00000 155.548223 30.400000 0.315 23.0

768 rows × 8 columns

In [ ]: