Cardiotocography

Analysis of UCI Cardiotocography Dataset

Getting the Data


In [1]:
%pylab inline
pylab.style.use('ggplot')
import pandas as pd
import numpy as np
import seaborn as sns


Populating the interactive namespace from numpy and matplotlib

In [2]:
data_df = pd.read_csv('cgt.csv')

In [3]:
data_df.head()


Out[3]:
LB AC FM UC ASTV MSTV ALTV MLTV DL DS ... C D E AD DE LD FS SUSP CLASS NSP
0 120 0 0 0 73 0.5 43 2.4 0 0 ... 0 0 0 0 0 0 1 0 9 2
1 132 4 0 4 17 2.1 0 10.4 2 0 ... 0 0 0 1 0 0 0 0 6 1
2 133 2 0 5 16 2.1 0 13.4 2 0 ... 0 0 0 1 0 0 0 0 6 1
3 134 2 0 6 16 2.4 0 23.0 2 0 ... 0 0 0 1 0 0 0 0 6 1
4 132 4 0 5 16 2.4 0 19.9 0 0 ... 0 0 0 0 0 0 0 0 2 1

5 rows × 34 columns


In [4]:
data_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 34 columns):
LB          2126 non-null int64
AC          2126 non-null int64
FM          2126 non-null int64
UC          2126 non-null int64
ASTV        2126 non-null int64
MSTV        2126 non-null float64
ALTV        2126 non-null int64
MLTV        2126 non-null float64
DL          2126 non-null int64
DS          2126 non-null int64
DP          2126 non-null int64
DR          2126 non-null int64
Width       2126 non-null int64
Min         2126 non-null int64
Max         2126 non-null int64
Nmax        2126 non-null int64
Nzeros      2126 non-null int64
Mode        2126 non-null int64
Mean        2126 non-null int64
Median      2126 non-null int64
Variance    2126 non-null int64
Tendency    2126 non-null int64
A           2126 non-null int64
B           2126 non-null int64
C           2126 non-null int64
D           2126 non-null int64
E           2126 non-null int64
AD          2126 non-null int64
DE          2126 non-null int64
LD          2126 non-null int64
FS          2126 non-null int64
SUSP        2126 non-null int64
CLASS       2126 non-null int64
NSP         2126 non-null int64
dtypes: float64(2), int64(32)
memory usage: 564.8 KB

Summarize


In [5]:
data_df.describe().T


Out[5]:
count mean std min 25% 50% 75% max
LB 2126.0 133.303857 9.840844 106.0 126.0 133.0 140.0 160.0
AC 2126.0 2.722484 3.560850 0.0 0.0 1.0 4.0 26.0
FM 2126.0 7.241298 37.125309 0.0 0.0 0.0 2.0 564.0
UC 2126.0 3.659925 2.847094 0.0 1.0 3.0 5.0 23.0
ASTV 2126.0 46.990122 17.192814 12.0 32.0 49.0 61.0 87.0
MSTV 2126.0 1.332785 0.883241 0.2 0.7 1.2 1.7 7.0
ALTV 2126.0 9.846660 18.396880 0.0 0.0 0.0 11.0 91.0
MLTV 2126.0 8.187629 5.628247 0.0 4.6 7.4 10.8 50.7
DL 2126.0 1.570085 2.499229 0.0 0.0 0.0 3.0 16.0
DS 2126.0 0.003293 0.057300 0.0 0.0 0.0 0.0 1.0
DP 2126.0 0.126058 0.464361 0.0 0.0 0.0 0.0 4.0
DR 2126.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
Width 2126.0 70.445908 38.955693 3.0 37.0 67.5 100.0 180.0
Min 2126.0 93.579492 29.560212 50.0 67.0 93.0 120.0 159.0
Max 2126.0 164.025400 17.944183 122.0 152.0 162.0 174.0 238.0
Nmax 2126.0 4.068203 2.949386 0.0 2.0 3.0 6.0 18.0
Nzeros 2126.0 0.323612 0.706059 0.0 0.0 0.0 0.0 10.0
Mode 2126.0 137.452023 16.381289 60.0 129.0 139.0 148.0 187.0
Mean 2126.0 134.610536 15.593596 73.0 125.0 136.0 145.0 182.0
Median 2126.0 138.090310 14.466589 77.0 129.0 139.0 148.0 186.0
Variance 2126.0 18.808090 28.977636 0.0 2.0 7.0 24.0 269.0
Tendency 2126.0 0.320320 0.610829 -1.0 0.0 0.0 1.0 1.0
A 2126.0 0.180621 0.384794 0.0 0.0 0.0 0.0 1.0
B 2126.0 0.272342 0.445270 0.0 0.0 0.0 1.0 1.0
C 2126.0 0.024929 0.155947 0.0 0.0 0.0 0.0 1.0
D 2126.0 0.038100 0.191482 0.0 0.0 0.0 0.0 1.0
E 2126.0 0.033866 0.180928 0.0 0.0 0.0 0.0 1.0
AD 2126.0 0.156162 0.363094 0.0 0.0 0.0 0.0 1.0
DE 2126.0 0.118532 0.323314 0.0 0.0 0.0 0.0 1.0
LD 2126.0 0.050329 0.218675 0.0 0.0 0.0 0.0 1.0
FS 2126.0 0.032455 0.177248 0.0 0.0 0.0 0.0 1.0
SUSP 2126.0 0.092662 0.290027 0.0 0.0 0.0 0.0 1.0
CLASS 2126.0 4.509878 3.026883 1.0 2.0 4.0 7.0 10.0
NSP 2126.0 1.304327 0.614377 1.0 1.0 1.0 1.0 3.0

Map the NSP code to the Category Name

The target NSP is defined as fetal state class code (N=normal; S=suspect; P=pathologic).


In [6]:
nsp_codes = {1: 'normal', 2: 'suspect', 3: 'pathologic'}

data_df = data_df.assign(NSP=data_df.NSP.map(lambda v: nsp_codes[v]))

In [7]:
data_df.head()


Out[7]:
LB AC FM UC ASTV MSTV ALTV MLTV DL DS ... C D E AD DE LD FS SUSP CLASS NSP
0 120 0 0 0 73 0.5 43 2.4 0 0 ... 0 0 0 0 0 0 1 0 9 suspect
1 132 4 0 4 17 2.1 0 10.4 2 0 ... 0 0 0 1 0 0 0 0 6 normal
2 133 2 0 5 16 2.1 0 13.4 2 0 ... 0 0 0 1 0 0 0 0 6 normal
3 134 2 0 6 16 2.4 0 23.0 2 0 ... 0 0 0 1 0 0 0 0 6 normal
4 132 4 0 5 16 2.4 0 19.9 0 0 ... 0 0 0 0 0 0 0 0 2 normal

5 rows × 34 columns


In [8]:
# CLASS is a code, not a number. Remove this feature for now.
data_df = data_df.drop('CLASS', axis=1)

Class Imbalance


In [9]:
data_df.NSP.value_counts().plot(kind='bar')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb89e1e9e8>

Feature Selection by Mutual Information


In [10]:
from sklearn.feature_selection import mutual_info_classif

In [11]:
f_info = mutual_info_classif(data_df.drop('NSP', axis=1), data_df.NSP)
f_info = pd.Series(f_info, index=data_df.columns.drop('NSP').copy())

_, ax = pylab.subplots(1, 1, figsize=(6, 6))
f_info.sort_values(ascending=True).plot(kind='barh', ax=ax)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8a18af28>

Now let's select the top 4 features into our model.


In [12]:
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'SUSP')



In [13]:
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'MSTV')



In [14]:
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'ASTV')



In [15]:
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'ALTV')



In [16]:
top4 = ['SUSP', 'MSTV', 'ASTV', 'ALTV']
smaller_df = data_df.loc[:, top4].assign(NSP=data_df.NSP)

In [17]:
sns.pairplot(smaller_df, hue='NSP')


Out[17]:
<seaborn.axisgrid.PairGrid at 0x2bb8b778898>

Correlations Amongst the Top 4 Features


In [18]:
f_corrs = smaller_df.drop('NSP', axis=1).corr()
sns.heatmap(f_corrs, annot=True)


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8b3e2da0>

First Model: Gaussian NB with Proportional Priors


In [19]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.naive_bayes import GaussianNB

In [20]:
model = GaussianNB()

X = smaller_df.drop('NSP', axis=1)
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_prop_prior = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_prop_prior = pd.Series(scores_prop_prior)

scores_prop_prior.plot(kind='bar')


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bb9f668>

Second Model: Gaussian NB with Equal Priors


In [21]:
model = GaussianNB(priors=[1/3, 1/3, 1/3])

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_eq_prior = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_eq_prior = pd.Series(scores_eq_prior)

scores_eq_prior.plot(kind='bar')


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bc44940>

In [22]:
scores = pd.DataFrame({'proportional_priors': scores_prop_prior, 'equal_priors': scores_eq_prior})
scores.plot(kind='bar')


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bca3cc0>

In [23]:
scores.mean()


Out[23]:
equal_priors           0.687109
proportional_priors    0.689554
dtype: float64

Include More Features


In [24]:
features_to_use = f_info[f_info > 0.10].index

In [25]:
features_to_use


Out[25]:
Index(['LB', 'AC', 'ASTV', 'MSTV', 'ALTV', 'Width', 'Min', 'Mode', 'Mean',
       'Median', 'Variance', 'LD', 'SUSP'],
      dtype='object')

In [26]:
model = GaussianNB()

X = data_df.loc[:, features_to_use]
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_with_more_features = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_with_more_features = pd.Series(scores_with_more_features)
scores_with_more_features.plot(kind='bar')


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bb90a90>

In [27]:
len(features_to_use)


Out[27]:
13

In [28]:
scores_cmp = pd.DataFrame({'f_4': scores_prop_prior, 'f_13': scores_with_more_features})

In [29]:
scores_cmp.plot(kind='bar')


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8d5f8e48>

Switch to a Random Forest


In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
model = RandomForestClassifier(
    n_estimators=10,
    max_depth=4,     
    max_features=5)

X = data_df.loc[:, features_to_use]
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_with_rf_13 = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_with_rf_13 = pd.Series(scores_with_rf_13)
scores_with_rf_13.plot(kind='bar')


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8d7dcc50>

In [32]:
scores_with_rf_13.mean()


Out[32]:
0.8959406630267732

Grid-Search for best RF Parameters


In [33]:
from sklearn.model_selection import GridSearchCV

In [34]:
model = RandomForestClassifier()

param_grid = {
    'n_estimators': [10, 20],
    'max_depth': [4, 8],     
    'max_features': [4, 5, 13],
}

X = data_df.loc[:, features_to_use]
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)

grid_search = GridSearchCV(
    estimator=model, 
    param_grid=param_grid, 
    scoring='f1_macro', 
    verbose=1,
    cv=cv)

grid_search = grid_search.fit(X, y)


Fitting 10 folds for each of 12 candidates, totalling 120 fits
[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:   10.0s finished

In [35]:
grid_search.best_params_


Out[35]:
{'max_depth': 8, 'max_features': 13, 'n_estimators': 20}

In [36]:
grid_search.best_score_


Out[36]:
0.92873059847296091

Check for overfit on the Best Parameters


In [37]:
grid_search.cv_results_['rank_test_score']


Out[37]:
array([12, 11,  9, 10,  8,  7,  5,  3,  4,  2,  6,  1])

In [38]:
best_test_score = grid_search.cv_results_['split9_test_score']
best_train_score = grid_search.cv_results_['split9_train_score']

best_scores = pd.DataFrame({'best_train': best_train_score, 'best_test': best_test_score})
best_scores.plot(kind='bar')


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bb8d886320>