Cardiotocography

Analysis of UCI Cardiotocography Dataset

Getting the Data



In [1]:

    
%pylab inline
pylab.style.use('ggplot')
import pandas as pd
import numpy as np
import seaborn as sns









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
data_df = pd.read_csv('cgt.csv')



In [3]:

    
data_df.head()









    Out[3]:







  
    
      
      LB
      AC
      FM
      UC
      ASTV
      MSTV
      ALTV
      MLTV
      DL
      DS
      ...
      C
      D
      E
      AD
      DE
      LD
      FS
      SUSP
      CLASS
      NSP
    
  
  
    
      0
      120
      0
      0
      0
      73
      0.5
      43
      2.4
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      9
      2
    
    
      1
      132
      4
      0
      4
      17
      2.1
      0
      10.4
      2
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      6
      1
    
    
      2
      133
      2
      0
      5
      16
      2.1
      0
      13.4
      2
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      6
      1
    
    
      3
      134
      2
      0
      6
      16
      2.4
      0
      23.0
      2
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      6
      1
    
    
      4
      132
      4
      0
      5
      16
      2.4
      0
      19.9
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2
      1
    
  

5 rows × 34 columns



In [4]:

    
data_df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 34 columns):
LB          2126 non-null int64
AC          2126 non-null int64
FM          2126 non-null int64
UC          2126 non-null int64
ASTV        2126 non-null int64
MSTV        2126 non-null float64
ALTV        2126 non-null int64
MLTV        2126 non-null float64
DL          2126 non-null int64
DS          2126 non-null int64
DP          2126 non-null int64
DR          2126 non-null int64
Width       2126 non-null int64
Min         2126 non-null int64
Max         2126 non-null int64
Nmax        2126 non-null int64
Nzeros      2126 non-null int64
Mode        2126 non-null int64
Mean        2126 non-null int64
Median      2126 non-null int64
Variance    2126 non-null int64
Tendency    2126 non-null int64
A           2126 non-null int64
B           2126 non-null int64
C           2126 non-null int64
D           2126 non-null int64
E           2126 non-null int64
AD          2126 non-null int64
DE          2126 non-null int64
LD          2126 non-null int64
FS          2126 non-null int64
SUSP        2126 non-null int64
CLASS       2126 non-null int64
NSP         2126 non-null int64
dtypes: float64(2), int64(32)
memory usage: 564.8 KB

Summarize



In [5]:

    
data_df.describe().T

Map the NSP code to the Category Name

The target NSP is defined as fetal state class code (N=normal; S=suspect; P=pathologic).



In [6]:

    
nsp_codes = {1: 'normal', 2: 'suspect', 3: 'pathologic'}

data_df = data_df.assign(NSP=data_df.NSP.map(lambda v: nsp_codes[v]))



In [7]:

    
data_df.head()









    Out[7]:







  
    
      
      LB
      AC
      FM
      UC
      ASTV
      MSTV
      ALTV
      MLTV
      DL
      DS
      ...
      C
      D
      E
      AD
      DE
      LD
      FS
      SUSP
      CLASS
      NSP
    
  
  
    
      0
      120
      0
      0
      0
      73
      0.5
      43
      2.4
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      9
      suspect
    
    
      1
      132
      4
      0
      4
      17
      2.1
      0
      10.4
      2
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      6
      normal
    
    
      2
      133
      2
      0
      5
      16
      2.1
      0
      13.4
      2
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      6
      normal
    
    
      3
      134
      2
      0
      6
      16
      2.4
      0
      23.0
      2
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      6
      normal
    
    
      4
      132
      4
      0
      5
      16
      2.4
      0
      19.9
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2
      normal
    
  

5 rows × 34 columns



In [8]:

    
# CLASS is a code, not a number. Remove this feature for now.
data_df = data_df.drop('CLASS', axis=1)

Class Imbalance



In [9]:

    
data_df.NSP.value_counts().plot(kind='bar')









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb89e1e9e8>

Feature Selection by Mutual Information



In [10]:

    
from sklearn.feature_selection import mutual_info_classif



In [11]:

    
f_info = mutual_info_classif(data_df.drop('NSP', axis=1), data_df.NSP)
f_info = pd.Series(f_info, index=data_df.columns.drop('NSP').copy())

_, ax = pylab.subplots(1, 1, figsize=(6, 6))
f_info.sort_values(ascending=True).plot(kind='barh', ax=ax)









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8a18af28>

Now let's select the top 4 features into our model.



In [12]:

    
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'SUSP')



In [13]:

    
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'MSTV')



In [14]:

    
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'ASTV')



In [15]:

    
fg = sns.FacetGrid(col='NSP', data=data_df)
fg = fg.map(pylab.hist, 'ALTV')



In [16]:

    
top4 = ['SUSP', 'MSTV', 'ASTV', 'ALTV']
smaller_df = data_df.loc[:, top4].assign(NSP=data_df.NSP)



In [17]:

    
sns.pairplot(smaller_df, hue='NSP')









    Out[17]:





<seaborn.axisgrid.PairGrid at 0x2bb8b778898>

Correlations Amongst the Top 4 Features



In [18]:

    
f_corrs = smaller_df.drop('NSP', axis=1).corr()
sns.heatmap(f_corrs, annot=True)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8b3e2da0>

First Model: Gaussian NB with Proportional Priors



In [19]:

    
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.naive_bayes import GaussianNB



In [20]:

    
model = GaussianNB()

X = smaller_df.drop('NSP', axis=1)
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_prop_prior = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_prop_prior = pd.Series(scores_prop_prior)

scores_prop_prior.plot(kind='bar')









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bb9f668>

Second Model: Gaussian NB with Equal Priors



In [21]:

    
model = GaussianNB(priors=[1/3, 1/3, 1/3])

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_eq_prior = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_eq_prior = pd.Series(scores_eq_prior)

scores_eq_prior.plot(kind='bar')









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bc44940>



In [22]:

    
scores = pd.DataFrame({'proportional_priors': scores_prop_prior, 'equal_priors': scores_eq_prior})
scores.plot(kind='bar')









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bca3cc0>



In [23]:

    
scores.mean()









    Out[23]:





equal_priors           0.687109
proportional_priors    0.689554
dtype: float64

Include More Features



In [24]:

    
features_to_use = f_info[f_info > 0.10].index



In [25]:

    
features_to_use









    Out[25]:





Index(['LB', 'AC', 'ASTV', 'MSTV', 'ALTV', 'Width', 'Min', 'Mode', 'Mean',
       'Median', 'Variance', 'LD', 'SUSP'],
      dtype='object')



In [26]:

    
model = GaussianNB()

X = data_df.loc[:, features_to_use]
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_with_more_features = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_with_more_features = pd.Series(scores_with_more_features)
scores_with_more_features.plot(kind='bar')









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8bb90a90>



In [27]:

    
len(features_to_use)









    Out[27]:





13



In [28]:

    
scores_cmp = pd.DataFrame({'f_4': scores_prop_prior, 'f_13': scores_with_more_features})



In [29]:

    
scores_cmp.plot(kind='bar')









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8d5f8e48>

Switch to a Random Forest



In [30]:

    
from sklearn.ensemble import RandomForestClassifier



In [31]:

    
model = RandomForestClassifier(
    n_estimators=10,
    max_depth=4,     
    max_features=5)

X = data_df.loc[:, features_to_use]
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
scores_with_rf_13 = cross_val_score(model, X=X, y=y, cv=cv, scoring='f1_macro')
scores_with_rf_13 = pd.Series(scores_with_rf_13)
scores_with_rf_13.plot(kind='bar')









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8d7dcc50>



In [32]:

    
scores_with_rf_13.mean()









    Out[32]:





0.8959406630267732

Grid-Search for best RF Parameters



In [33]:

    
from sklearn.model_selection import GridSearchCV



In [34]:

    
model = RandomForestClassifier()

param_grid = {
    'n_estimators': [10, 20],
    'max_depth': [4, 8],     
    'max_features': [4, 5, 13],
}

X = data_df.loc[:, features_to_use]
y =  data_df.NSP

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)

grid_search = GridSearchCV(
    estimator=model, 
    param_grid=param_grid, 
    scoring='f1_macro', 
    verbose=1,
    cv=cv)

grid_search = grid_search.fit(X, y)









    



Fitting 10 folds for each of 12 candidates, totalling 120 fits






    



[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:   10.0s finished



In [35]:

    
grid_search.best_params_









    Out[35]:





{'max_depth': 8, 'max_features': 13, 'n_estimators': 20}



In [36]:

    
grid_search.best_score_









    Out[36]:





0.92873059847296091

Check for overfit on the Best Parameters



In [37]:

    
grid_search.cv_results_['rank_test_score']









    Out[37]:





array([12, 11,  9, 10,  8,  7,  5,  3,  4,  2,  6,  1])



In [38]:

    
best_test_score = grid_search.cv_results_['split9_test_score']
best_train_score = grid_search.cv_results_['split9_train_score']

best_scores = pd.DataFrame({'best_train': best_train_score, 'best_test': best_test_score})
best_scores.plot(kind='bar')









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x2bb8d886320>

	count	mean	std	min	25%	50%	75%	max
LB	2126.0	133.303857	9.840844	106.0	126.0	133.0	140.0	160.0
AC	2126.0	2.722484	3.560850	0.0	0.0	1.0	4.0	26.0
FM	2126.0	7.241298	37.125309	0.0	0.0	0.0	2.0	564.0
UC	2126.0	3.659925	2.847094	0.0	1.0	3.0	5.0	23.0
ASTV	2126.0	46.990122	17.192814	12.0	32.0	49.0	61.0	87.0
MSTV	2126.0	1.332785	0.883241	0.2	0.7	1.2	1.7	7.0
ALTV	2126.0	9.846660	18.396880	0.0	0.0	0.0	11.0	91.0
MLTV	2126.0	8.187629	5.628247	0.0	4.6	7.4	10.8	50.7
DL	2126.0	1.570085	2.499229	0.0	0.0	0.0	3.0	16.0
DS	2126.0	0.003293	0.057300	0.0	0.0	0.0	0.0	1.0
DP	2126.0	0.126058	0.464361	0.0	0.0	0.0	0.0	4.0
DR	2126.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
Width	2126.0	70.445908	38.955693	3.0	37.0	67.5	100.0	180.0
Min	2126.0	93.579492	29.560212	50.0	67.0	93.0	120.0	159.0
Max	2126.0	164.025400	17.944183	122.0	152.0	162.0	174.0	238.0
Nmax	2126.0	4.068203	2.949386	0.0	2.0	3.0	6.0	18.0
Nzeros	2126.0	0.323612	0.706059	0.0	0.0	0.0	0.0	10.0
Mode	2126.0	137.452023	16.381289	60.0	129.0	139.0	148.0	187.0
Mean	2126.0	134.610536	15.593596	73.0	125.0	136.0	145.0	182.0
Median	2126.0	138.090310	14.466589	77.0	129.0	139.0	148.0	186.0
Variance	2126.0	18.808090	28.977636	0.0	2.0	7.0	24.0	269.0
Tendency	2126.0	0.320320	0.610829	-1.0	0.0	0.0	1.0	1.0
A	2126.0	0.180621	0.384794	0.0	0.0	0.0	0.0	1.0
B	2126.0	0.272342	0.445270	0.0	0.0	0.0	1.0	1.0
C	2126.0	0.024929	0.155947	0.0	0.0	0.0	0.0	1.0
D	2126.0	0.038100	0.191482	0.0	0.0	0.0	0.0	1.0
E	2126.0	0.033866	0.180928	0.0	0.0	0.0	0.0	1.0
AD	2126.0	0.156162	0.363094	0.0	0.0	0.0	0.0	1.0
DE	2126.0	0.118532	0.323314	0.0	0.0	0.0	0.0	1.0
LD	2126.0	0.050329	0.218675	0.0	0.0	0.0	0.0	1.0
FS	2126.0	0.032455	0.177248	0.0	0.0	0.0	0.0	1.0
SUSP	2126.0	0.092662	0.290027	0.0	0.0	0.0	0.0	1.0
CLASS	2126.0	4.509878	3.026883	1.0	2.0	4.0	7.0	10.0
NSP	2126.0	1.304327	0.614377	1.0	1.0	1.0	1.0	3.0

	LB	AC	UC	ASTV	MSTV	ALTV	MLTV	DL	...	AD	FS	CLASS	NSP
0	120	0	0	73	0.5	43	2.4	0	...	0	1	9	2
1	132	4	4	17	2.1	0	10.4	2	...	1	0	6	1
2	133	2	5	16	2.1	0	13.4	2	...	1	0	6	1
3	134	2	6	16	2.4	0	23.0	2	...	1	0	6	1
4	132	4	5	16	2.4	0	19.9	0	...	0	0	2	1

	LB	AC	UC	ASTV	MSTV	ALTV	MLTV	DL	...	AD	FS	CLASS	NSP
0	120	0	0	73	0.5	43	2.4	0	...	0	1	9	suspect
1	132	4	4	17	2.1	0	10.4	2	...	1	0	6	normal
2	133	2	5	16	2.1	0	13.4	2	...	1	0	6	normal
3	134	2	6	16	2.4	0	23.0	2	...	1	0	6	normal
4	132	4	5	16	2.4	0	19.9	0	...	0	0	2	normal