In [1]:
### Example taken from :: https://machinelearningmastery.com/handle-missing-data-python/
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:
In [2]:
import pandas as pd
dataset = pd.read_csv('pima-indians-diabetes.data.txt', header=None)
dataset.head()
Out[2]:
In [3]:
print(dataset.describe())
This is useful.
We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.
Specifically, the following columns have an invalid zero minimum value:
1: Plasma glucose concentration 2: Diastolic blood pressure 3: Triceps skinfold thickness 4: 2-Hour serum insulin 5: Body mass index
We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.
We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.
In [4]:
print((dataset[[1,2,3,4,5]]== 0).sum())
In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc.
We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.
After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.
In [6]:
import numpy as np
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN )
# count the number of NaN values in each column
print(dataset.isnull().sum())
Having missing values in a dataset can cause errors with some machine learning algorithms.
In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.
This is an algorithm that does not work when there are missing values in the dataset.
The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.
In [8]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# split dataset into inputs and outputs
values = dataset.values ## generated matrix out of the dataframe
X = values[:,:8]
y = values[:,8]
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits =3 , random_state = 7)
result = cross_val_score(model,X, y, cv= kfold, scoring = 'accuracy' )
print('result.mean():', result.mean())
This is as we expect.
We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values.
Now, we can look at methods to handle the missing values.
The simplest strategy for handling missing data is to remove records that contain a missing value.
We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.
Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:
In [10]:
# drop rows with missing values
dataset.dropna(inplace = True)
# summarize the number of rows and columns in the dataset
print(dataset.shape)
We now have a dataset that we could use to evaluate an algorithm sensitive to missing values like LDA.
In [11]:
values = dataset.values ## generated matrix out of the dataframe
X = values[:,:8]
y = values[:,8]
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits =3 , random_state = 7)
result = cross_val_score(model,X, y, cv= kfold, scoring = 'accuracy' )
print('result.mean():', result.mean())
Removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.
In [13]:
dataset = pd.read_csv('pima-indians-diabetes.data.txt', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace = True)
# count the number of NaN values in each column
print(dataset.isnull().sum())
**The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values.
It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The Imputer class operates directly on the NumPy array instead of the DataFrame.
The example below uses the Imputer class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix. **
In [18]:
from sklearn.preprocessing import Imputer
#values = dataset.values ## generated matrix out of the dataframe
#X = values[:,:8]
#y = values[:,8]
imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis =0)
imp.fit(dataset) ######## working
print(dataset.isnull().sum()) ###### working
In [27]:
import pandas as pd
df = pd.read_csv('pima-indians-diabetes.data.txt', header=None)
y = df[8]
X = df.drop(8, axis =1)
#print((X[[1,2,3,4,5]]== 0).sum()) # works
print((X == 0).sum()) # also works
imp = Imputer(missing_values = 0, strategy = 'mean', axis =0)
imp.fit(X) ######## working
print((X == 0).sum()) ###### working
X.head()
Out[27]:
In [34]:
import pandas as pd
df = pd.read_csv('pima-indians-diabetes.data.txt', header=None)
y = df[8]
X = df.drop(8, axis =1)
print((X == 0).sum()) # also works
imp = Imputer(missing_values = 0, strategy = 'mean', axis =0)
X_imp = imp.fit_transform(X)
print((X_imp == 0).sum())
X_imp.head()
In [ ]:
'''
https://stackoverflow.com/questions/33660836/impute-entire-dataframe-all-columns-using-scikit-learn-sklearn-without-itera
If you want the mean or median you could do something like:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
imputed_DF = pd.DataFrame(fill_NaN.fit_transform(DF))
imputed_DF.columns = DF.columns
imputed_DF.index = DF.index
If you want to fill them with 0s or something you could always just do:
DF[DF.isnull()] = 0
'''
In [40]:
import pandas as pd
df = pd.read_csv('pima-indians-diabetes.data.txt', header=None)
#values = df.values
#X = values[:,0:8]
#y = values[:,8]
y = df[8]
X = df.drop(8, axis =1)
print(X)
In [42]:
imp = Imputer(missing_values = 0, strategy = 'mean', axis =0)
X_imp = pd.DataFrame(imp.fit_transform(X))
X_imp.columns = X.columns
X_imp.index = X.index
print((X_imp == 0).sum())
X_imp
Out[42]:
In [ ]: