Data pre-processing in python, using Pima Indians diabetes dataset from National Institute of Diabetes and Digestive and Kidney Diseases

Citation: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Feature Information:

  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skin fold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)^2)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0 or 1) i.e., Diabetes found? (no/yes)

1.0 Load data from CSV


In [1]:
import pandas as pd
from pandas import read_csv
pd.set_option('precision', 3) # set display precision to 3 significant figures

filename = 'C:/Users/craigrshenton/Desktop/Dropbox/python/python_pro/machine_learning_mastery_with_python/machine_learning_mastery_with_python_code/chapter_07/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = read_csv(filename, names=names)
df.head()


Out[1]:
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

2.0 Split data into feature (input) and target (output) set


In [2]:
feature_cols = df.columns[0:8]
X = df[feature_cols]    # first 8 cols are features
y = df['class']            # last col is target data

3.0 Rescale

Homogenise data of varying scales to take values between 0 and 1


In [3]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
df_scaled = pd.DataFrame(data=rescaledX, columns=feature_cols)
df_scaled.head()


Out[3]:
preg plas pres skin test mass pedi age
0 0.353 0.744 0.590 0.354 0.000 0.501 0.234 0.483
1 0.059 0.427 0.541 0.293 0.000 0.396 0.117 0.167
2 0.471 0.920 0.525 0.000 0.000 0.347 0.254 0.183
3 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.000
4 0.000 0.688 0.328 0.354 0.199 0.642 0.944 0.200

4.0 Standardisation

Standardise normally distributed data to have a mean of 0 and standard deviation of 1


In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
standardX = scaler.transform(X)
df_standard = pd.DataFrame(data=standardX, columns=feature_cols)
df_standard.head()


Out[4]:
preg plas pres skin test mass pedi age
0 0.640 0.848 0.150 0.907 -0.693 0.204 0.468 1.426
1 -0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191
2 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106
3 -0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042
4 -1.142 0.504 -1.505 0.907 0.766 1.410 5.485 -0.020

5.0 Normalisation

Normalise data such that each row has a vector length of 1


In [5]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
df_norm = pd.DataFrame(data=normalizedX, columns=feature_cols)
df_norm.head()


Out[5]:
preg plas pres skin test mass pedi age
0 0.034 0.828 0.403 0.196 0.000 0.188 0.004 0.280
1 0.008 0.716 0.556 0.244 0.000 0.224 0.003 0.261
2 0.040 0.924 0.323 0.000 0.000 0.118 0.003 0.162
3 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139
4 0.000 0.596 0.174 0.152 0.731 0.188 0.010 0.144

In [ ]:


In [ ]: