Perhaps one of the trendiest topics in the world right now is machine learning and big data, which have become recurring topics on Main Street. The application of these in different fields including branches of knowledge that range from astronomy to manufacturing, but perhaps the field of medicine is one the most interesting of them all. Current advances in this field include disease diagnosis, image analysis, new drug developments and pandemic forecasts among others. These progress poses new challenges and opportunities to improve the quality of treatments and the life expectancy of human beings.
Being able to participate in something that contributes to human well-being motivated me to research and apply machine learning in an area of medicine to the capstone project.
From personal experience, a recurring problem that affects people at certain ages is pneumonia, which in some cases is usually fatal especially when it refers to children or older adults. Knowing this and the availability of records in Intensive Care Unit MIMIC-III database, I decided to select pneumonia deaths as a theme to develop the capstone project, that in addition, could be useful for the prognosis of deaths in Intensive Care Units, and with further investigations be the outset for development of tools that doctors and nurses could use to improve their work.
The hypothesis is that from microbiological variables coming from tests it’s possible predict whether a patient can or not die by this disease. This leads us to a problem that can be represented binarily, and that can be related to physicochemical variables obtained from microbiological tests. These relationships allow the modeling of the pneumonia death problem by means of a supervised learning model such as the Support Vector Machines, Decision Trees, Logistic Regression and Ada boost ensemble.
In [1]:
import numpy as np
import pandas as pd
import datetime
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import psycopg2
import time
import itertools
from pandas.io.sql import read_sql
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import VotingClassifier
from sklearn import metrics
The MIMIC III database consists of a collection of csv files, which can be imported into PostgreSQL. Once imported to PostgreSQL, is possible to use libraries from python to analyze the different data contained in it, make the necessary transformations to implement the desired forecast model. The input variables will be defined from different tables and seeking relate the independent binary variable (life or death) of subjects with age, sex, results of microbiological events (test results) and severity of these results.
Before you can download the data, you must complete the CITI "Data or Specimens Only Research" course . Once you accomplish the course you can download the data from https://physionet.org/works/MIMICIIIClinicalDatabase/. Then, the first step was to understand the structure of the database. This consists of a collection of 26 csv files, which can be imported into PostgreSQL. These files contain medical, economic, demographic and death of patients admitted information for several years at the ICU of Beth Israel Deaconess Medical Center, as the data is sensitive, some records were changed like date of admittance and date of birth, in order to avoid the identification of the patients from these records, and this information will be misused in the future.
In [2]:
conn=psycopg2.connect(
dbname='mimic',
user='postgres',
host='localhost',
port=5432,
password= 123
)
cur = conn.cursor()
process = time.process_time()
print (process)
The first step was creating four tables to facilitate consult the required data for the analysis, these tables are:
The final result of these process is a sql query which filter and transform the data in a matrix with this columns fields: hospital_expire_flag, subject_id, gender, last_admit_age, age_group, category, label, valuenum_avg, org_name, ab_name, interpretation, and 2,430,640 records to 829 patients with some diagnosis related with pneumonia panda data frame
In [3]:
sql_sl = " SELECT hospital_expire_flag, subject_id, gender, last_admit_age, age_group, category, \
label, valuenum_avg, org_name,ab_name, interpretation \
FROM mimiciii.pneumonia;"
patients_pn= read_sql(sql_sl, conn, coerce_float=True, params=None)
print (patients_pn.info())
process = time.process_time()
print (process)
For each subject, I categorized the population in five types of groups according to the age recorded at the time of admission to the ICU, which are neonates [0,1], middle (1, 14), adults (14, 65), Older adults [65, 85] and older elderly people (85, 91.4].
In [4]:
patients_pn.head()
Out[4]:
In the Website they explain that the average age of these patients is 91.4 years, reason why I decided that if I want have some consistent data from this segment of the population I should replace it at least for its average value
In [5]:
row_index = patients_pn.last_admit_age >= 300
patients_pn.loc[row_index , 'last_admit_age' ] = 91.4 #https://mimic.physionet.org/mimictables/patients/
In [6]:
patients_pn['category'].unique()
Out[6]:
In [7]:
patients_pn['category'].value_counts()
Out[7]:
In [8]:
patients_category = patients_pn['category'].value_counts().reset_index()
patients_category.columns=['category','Count']
patients_category['Count'].apply('{:,.2f}'.format)
patients_category['cum_perc'] = 100*patients_category.Count/patients_category.Count.sum()
patients_category['cum_perc'] = patients_category['cum_perc'].map('{:,.4f}%'.format)
print (patients_category)
In [9]:
patients_pn['category'].value_counts().plot(kind='bar')
Out[9]:
In [10]:
patients_pn['label'].unique()
Out[10]:
In [11]:
patients_pn['label'].value_counts()
Out[11]:
In [12]:
patients_label = patients_pn['label'].value_counts().reset_index()
patients_label.columns=['label','Count']
patients_label['Count'].apply('{:,.2f}'.format)
patients_label['cum_perc'] = 100*patients_label.Count/patients_label.Count.sum()
patients_label['cum_perc'] = patients_label['cum_perc'].map('{:,.4f}%'.format)
print (patients_label)
In [13]:
patients_pn['label'].value_counts().plot(kind='bar')
Out[13]:
In [14]:
patients_pn['ab_name'].unique()
Out[14]:
In [15]:
patients_pn['ab_name'].value_counts()
Out[15]:
In [16]:
patients_ab_name = patients_pn['ab_name'].value_counts().reset_index()
patients_ab_name.columns=['ab_name','Count']
patients_ab_name['Count'].apply('{:,.2f}'.format)
patients_ab_name['cum_perc'] = 100*patients_ab_name.Count/patients_ab_name.Count.sum()
patients_ab_name['cum_perc'] = patients_ab_name ['cum_perc'].map('{:,.4f}%'.format)
print (patients_ab_name)
In [17]:
patients_pn['ab_name'].value_counts().plot(kind='bar')
Out[17]:
In [18]:
patients_pn['org_name'].unique()
Out[18]:
In [19]:
patients_pn['org_name'].value_counts()
Out[19]:
In [20]:
patients_pn['org_name'].value_counts().plot(kind='bar')
Out[20]:
In [21]:
patients_org_name = patients_pn['org_name'].value_counts().reset_index()
patients_org_name.columns=['org_name','Count']
patients_org_name['Count'].apply('{:,.2f}'.format)
patients_org_name['cum_perc'] = 100*patients_org_name.Count/patients_org_name.Count.sum()
patients_org_name['cum_perc'] = patients_org_name['cum_perc'].map('{:,.4f}%'.format)
print (patients_org_name)
In [22]:
patients_pn['interpretation'].unique()
Out[22]:
In [23]:
patients_pn['interpretation'].value_counts()
Out[23]:
In [24]:
patients_interpretation = patients_pn['interpretation'].value_counts().reset_index()
patients_interpretation.columns=['interpretation','Count']
patients_interpretation['Count'].apply('{:,.2f}'.format)
patients_interpretation['cum_perc'] = 100*patients_interpretation.Count/patients_interpretation.Count.sum()
patients_interpretation['cum_perc'] = patients_interpretation ['cum_perc'].map('{:,.4f}%'.format)
print (patients_interpretation)
In [25]:
patients_pn['interpretation'].value_counts().plot(kind='bar')
Out[25]:
In [26]:
patients_pn.head()
Out[26]:
To transform the matrix in a pivot table, the first step is transform some categorical variables as dummy variables. The chosen variables are gender, age_group, category, label, org_name, ab_name, and interpretation. This operation was done with pandas get_dummies command. The result of this transformation is a panda data frame with shape 2,430,640 rows by 716 columns, these new columns are binaries variables and only take the number 1 once the categorical effect happened.
In [27]:
patients_dummy = pd.get_dummies(patients_pn,prefix=['gender', 'age_group', 'category','label',
'org_name','ab_name', 'interpretation'])
patients_dummy.head()
Out[27]:
The next step, is to transform the matrix into a PivotTable, the purpose of this transformation is to be able to have the medical data in individual lines per subject and numerical form.
To do that, I employed pandas pivot_table command, and using as indexes subject_id and hospital_expire_flag. With this transformation, the resulting panda data frame has 829 rows by 724 columns. The data transformed in this form allow apply the classifier models to this data.
In [28]:
patients_data = pd.pivot_table(patients_dummy,index=["subject_id", "hospital_expire_flag" ])
process = time.process_time()
print (process)
In [29]:
patients_data.head()
Out[29]:
In [30]:
patients_data.info()
In [31]:
patients = patients_data.reset_index()
In [32]:
patients.head()
Out[32]:
In [33]:
p_data= patients.ix[:,2:]
In [34]:
p_data.head()
Out[34]:
In [35]:
p_target= patients['hospital_expire_flag']
In [36]:
p_target.head()
Out[36]:
In all models, the variable dependent is survival state (Alive / Deceased). In order to sub-setting the data I work with a test size of 25% of the sample, I chose this value after some essays, a higher percentage could lower the computer velocity, and a higher value could make the results will be spurious or meaningless.
In [37]:
X_train, X_test, y_train, y_test = train_test_split(
p_data, p_target, test_size=0.25, random_state=123)
In [38]:
X_train.head()
Out[38]:
In [39]:
X_train.shape, y_train.shape
Out[39]:
In [40]:
X_test.shape, y_test.shape
Out[40]:
In [41]:
# the same model as above, only has changed the jobs
clf_SVC = SVC(kernel='linear', C=1)
scores_SVC = cross_val_score(clf_SVC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_SVC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_SVC.mean(), scores_SVC.std() * 2))
clf_SVC = clf_SVC.fit(X_train, y_train)
y_predicted_SVC = clf_SVC.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_SVC))
process = time.process_time()
print (process)
In [42]:
sns.heatmap(confusion_matrix(y_test, y_predicted_SVC), annot = True, fmt = '', cmap = "GnBu")
Out[42]:
In [43]:
print ("Fitting the Support Vector Classification - kernel Radial Basis Function classifier to the training set")
param_grid = {
'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_RBF = GridSearchCV(SVC(kernel='rbf', class_weight='balanced', cache_size=1000), param_grid)
clf_RBF = clf_RBF.fit(X_train, y_train)
scores_RBF = cross_val_score(clf_RBF, X_train, y_train, cv=4, n_jobs=-1)
print(scores_RBF)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_RBF.mean(), scores_RBF.std() * 2))
print ("Best estimator found by grid search:")
print (clf_RBF.best_estimator_)
y_predicted_RBF = clf_RBF.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_RBF))
process = time.process_time()
print (process)
In [44]:
sns.heatmap(confusion_matrix(y_test, y_predicted_RBF), annot = True, fmt = '', cmap = "GnBu")
Out[44]:
In [45]:
# Mimic-iii_Model-Pulmonary.ipynb
print ("Fitting the Linear Support Vector Classification - Hingue loss classifier to the training set")
param_grid = {
'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5],
#'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_LSCV = GridSearchCV(LinearSVC(C=1, loss= 'hinge'), param_grid, n_jobs=-1)
clf_LSCV = clf_LSCV.fit(X_train, y_train)
scores_LSCV = cross_val_score(clf_LSCV, X_train, y_train, cv=4, n_jobs=-1)
print(scores_LSCV)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_LSCV.mean(), scores_LSCV.std() * 2))
print ("Best estimator found by grid search:")
print (clf_LSCV.best_estimator_)
y_predicted_LSCV = clf_LSCV.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_LSCV))
process = time.process_time()
print (process)
In [46]:
sns.heatmap(confusion_matrix(y_test, y_predicted_LSCV), annot = True, fmt = '', cmap = "GnBu")
Out[46]:
print ("Fitting the classifier to the training set") from sklearn.tree import DecisionTreeClassifier param_grid = { 'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'degree': [3,4,5] }
clf_poly = GridSearchCV(SVC(kernel='poly', class_weight='balanced'), param_grid, n_jobs=-1) clf_poly = clf_poly.fit(X_train, y_train)
scores_poly = cross_val_score(clf_poly, X_train, y_train, cv=4,n_jobs=-1) print(scores_poly) print("Accuracy: %0.4f (+/- %0.4f)" % (scores_poly.mean(), scores_poly.std() * 2))
print ("Best estimator found by grid search:") print (clf_poly.bestestimator)
y_predicted_poly = clf_poly.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_poly))
process = time.process_time() print (process)
In [47]:
print ("Fitting the Support Vector Classification - kernel Sigmoid classifier to the training set")
param_grid = {
'C': [1e3, 1e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
'coef0':[-1,0,1]
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_SIGMOID = GridSearchCV(SVC(kernel='sigmoid', class_weight='balanced'), param_grid, n_jobs=-1)
clf_SIGMOID = clf_SIGMOID.fit(X_train, y_train)
scores_SIGMOID = cross_val_score(clf_SIGMOID, X_train, y_train, cv=4, n_jobs=-1)
print(scores_SIGMOID)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_SIGMOID.mean(), scores_SIGMOID.std() * 2))
print ("Best estimator found by grid search:")
print (clf_SIGMOID.best_estimator_)
y_predicted_SIGMOID = clf_SIGMOID.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_SIGMOID))
process = time.process_time()
print (process)
In [48]:
sns.heatmap(confusion_matrix(y_test, y_predicted_SIGMOID), annot = True, fmt = '', cmap = "GnBu")
Out[48]:
In [49]:
print ("Fitting the Decision Tree Classifier to the training set")
param_grid = {
'max_depth': [2, 3, 4, 5, 6,7],
#'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_DTC = GridSearchCV(DecisionTreeClassifier(criterion='entropy', random_state=123,
class_weight='balanced'), param_grid, n_jobs=-1)
clf_DTC = clf_DTC.fit(X_train, y_train)
scores_DTC = cross_val_score(clf_DTC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_DTC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_DTC.mean(), scores_DTC.std() * 2))
print ("Best estimator found by grid search:")
print (clf_DTC.best_estimator_)
y_predicted_DTC = clf_DTC.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_DTC))
process = time.process_time()
print (process)
In [50]:
sns.heatmap(confusion_matrix(y_test, y_predicted_DTC), annot = True, fmt = '', cmap = "GnBu")
Out[50]:
In [51]:
print ("Fitting the Random Forest Classifier to the training set")
param_grid = {
'n_estimators' :[3,5,7,10],
'max_depth': [2, 3, 4, 5, 6,7],
#'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_RFC = GridSearchCV(RandomForestClassifier(min_samples_split=2, random_state=123, class_weight='balanced'),
param_grid, n_jobs=-1)
clf_RFC = clf_RFC.fit(X_train, y_train)
scores_RFC = cross_val_score(clf_RFC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_RFC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_RFC.mean(), scores_RFC.std() * 2))
print ("Best estimator found by grid search:")
print (clf_RFC.best_estimator_)
y_predicted_RFC = clf_RFC.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_RFC))
process = time.process_time()
print (process)
In [52]:
sns.heatmap(confusion_matrix(y_test, y_predicted_RFC), annot = True, fmt = '', cmap = "GnBu")
Out[52]:
In [53]:
print ("Fitting the Extremely Tree Classifier to the training set")
param_grid = {
'n_estimators' :[3,5,10],
'max_depth': [2, 3, 4, 5, 6,7],
#'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_EFC = GridSearchCV(ExtraTreesClassifier(min_samples_split=2, random_state=123, class_weight='balanced'),
param_grid, n_jobs=-1)
clf_EFC = clf_EFC.fit(X_train, y_train)
scores_EFC = cross_val_score(clf_EFC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_EFC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_EFC.mean(), scores_EFC.std() * 2))
print ("Best estimator found by grid search:")
print (clf_EFC.best_estimator_)
y_predicted_EFC = clf_EFC.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_EFC))
process = time.process_time()
print (process)
In [54]:
sns.heatmap(confusion_matrix(y_test, y_predicted_EFC), annot = True, fmt = '', cmap = "GnBu")
Out[54]:
In [55]:
print ("Fitting the Ada Boost Classifier to the training set")
param_grid = {
'n_estimators' :[3,5,10],
'learning_rate': [0.01],
#'max_depth': [2, 3, 4, 5, 6,7],
#'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_ABC = GridSearchCV(AdaBoostClassifier(random_state=123), param_grid, n_jobs=-1)
clf_ABC = clf_ABC.fit(X_train, y_train)
scores_ABC = cross_val_score(clf_ABC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_ABC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_ABC.mean(), scores_ABC.std() * 2))
print ("Best estimator found by grid search:")
print (clf_ABC.best_estimator_)
y_predicted_ABC = clf_ABC.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_ABC))
process = time.process_time()
print (process)
In [56]:
sns.heatmap(confusion_matrix(y_test, y_predicted_ABC), annot = True, fmt = '', cmap = "GnBu")
Out[56]:
In [57]:
print ("Fitting the Logistic Regression Classification - Hingue loss classifier to the training set")
param_grid = {
'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5],
#'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_LOGREG = GridSearchCV(LogisticRegression(random_state=123), param_grid, n_jobs=-1)
clf_LOGREG= clf_LOGREG .fit(X_train, y_train)
scores_LOGREG = cross_val_score(clf_LOGREG, X_train, y_train, cv=4, n_jobs=-1)
print(scores_LOGREG)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_LOGREG.mean(), scores_LOGREG.std() * 2))
print ("Best estimator found by grid search:")
print (clf_LOGREG.best_estimator_)
y_predicted_LOGREG = clf_LOGREG.predict(X_test)
print (metrics.classification_report(y_test, y_predicted_LOGREG))
process = time.process_time()
print (process)
In [58]:
sns.heatmap(confusion_matrix(y_test, y_predicted_LOGREG), annot = True, fmt = '', cmap = "GnBu")
Out[58]:
In [59]:
# Best Models
clf_RBF_b = SVC(C=1, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
y_predicted_RBF_b = clf_RBF_b.fit(X_train,y_train).predict(X_test)
clf_LSCV_b = LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
penalty='l2', random_state=None, tol=0.0001, verbose=0)
y_predicted_LSCV_b = clf_LSCV_b.fit(X_train,y_train).predict(X_test)
clf_SIGMOID_b = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=1,
decision_function_shape=None, degree=3, gamma=0.001, kernel='sigmoid',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
y_predicted_SIGMOID_b = clf_SIGMOID_b.fit(X_train,y_train).predict(X_test)
clf_DTC_b = DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=7, max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=123, splitter='best')
y_predicted_DTC_b = clf_DTC_b.fit(X_train,y_train).predict(X_test)
clf_RFC_b = RandomForestClassifier(bootstrap=True, class_weight='balanced',
criterion='gini', max_depth=7, max_features='auto',
max_leaf_nodes=None, min_impurity_split=1e-07,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=123, verbose=0, warm_start=False)
y_predicted_RFC_b = clf_RFC_b.fit(X_train,y_train).predict(X_test)
clf_EFC_b = ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
criterion='gini', max_depth=5, max_features='auto',
max_leaf_nodes=None, min_impurity_split=1e-07,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=123, verbose=0, warm_start=False)
y_predicted_EFC_b = clf_EFC_b.fit(X_train,y_train).predict(X_test)
clf_ABC_b = AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=0.01, n_estimators=3, random_state=123)
y_predicted_ABC_b = clf_ABC_b.fit(X_train,y_train).predict(X_test)
clf_LR_b= LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
y_predicted_LR_b = clf_LR_b.fit(X_train,y_train).predict(X_test)
In [60]:
fig, axes = plt.subplots(2,4)
sns.heatmap(confusion_matrix(y_test, y_predicted_RBF_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 0])
sns.heatmap(confusion_matrix(y_test, y_predicted_LSCV_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 1])
sns.heatmap(confusion_matrix(y_test, y_predicted_SIGMOID_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 2])
sns.heatmap(confusion_matrix(y_test, y_predicted_DTC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 3])
sns.heatmap(confusion_matrix(y_test, y_predicted_RFC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 0])
sns.heatmap(confusion_matrix(y_test, y_predicted_EFC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 1])
sns.heatmap(confusion_matrix(y_test, y_predicted_ABC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 2])
sns.heatmap(confusion_matrix(y_test, y_predicted_LR_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 3])
Out[60]:
In [61]:
eclf1 = VotingClassifier(estimators=[
('rbf',clf_RBF_b), ('LSCV',clf_LSCV_b),
('sigmoid',clf_SIGMOID_b), ('DTC',clf_DTC_b),
('RFC',clf_RFC_b),('EFC',clf_EFC_b),
('ABC',clf_ABC_b), ('svc',clf_LR_b)],
voting='hard')
eclf1 = eclf1.fit(X_train, y_train)
y_predict_eclf1 = eclf1.predict(X_test)
print (eclf1.get_params(deep=True))
print (eclf1.score(X_train, y_train, sample_weight=None))
In [62]:
eclf2 = VotingClassifier(estimators=[
('rbf',clf_RBF), ('LSCV',clf_LSCV),
('sigmoid',clf_SIGMOID), ('DTC',clf_DTC),
('RFC',clf_RFC),('EFC',clf_EFC),
('ABC',clf_ABC), ('svc',clf_LOGREG)],
voting='hard')
eclf2 = eclf2.fit(X_train, y_train)
y_predict_eclf2 = eclf2.predict(X_test)
print (eclf2.get_params(deep=True))
print (eclf2.score(X_train, y_train, sample_weight=None))
#Basically does the same that chose the best models, this function uses the best models too
In [63]:
# Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
In [64]:
cnf_matrix_RBF = confusion_matrix(y_test, y_predicted_RBF_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RBF, classes= '1',
title='RBF Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RBF, classes='1', normalize=True,
title='RBF Normalized confusion matrix')
plt.show()
In [65]:
cnf_matrix_LSCV = confusion_matrix(y_test, y_predicted_LSCV_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LSCV, classes='1',
title='LSCV Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LSCV, classes='1', normalize=True,
title='LSCV Normalized confusion matrix')
plt.show()
In [66]:
cnf_matrix_SIGMOID = confusion_matrix(y_test, y_predicted_SIGMOID_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_SIGMOID, classes='1',
title='SIGMOID Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_SIGMOID, classes='1', normalize=True,
title='SIGMOID Normalized confusion matrix')
plt.show()
In [67]:
cnf_matrix_DTC = confusion_matrix(y_test, y_predicted_DTC_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_DTC, classes='1',
title='DTC Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_DTC, classes='1', normalize=True,
title='DTC Normalized confusion matrix')
plt.show()
In [68]:
cnf_matrix_RFC = confusion_matrix(y_test, y_predicted_RFC_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RFC, classes='1',
title='RFC Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RFC, classes='1', normalize=True,
title='RFC Normalized confusion matrix')
plt.show()
In [69]:
cnf_matrix_EFC = confusion_matrix(y_test, y_predicted_EFC_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_EFC, classes='1',
title='EFC Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_EFC, classes='1', normalize=True,
title='EFC Normalized confusion matrix')
plt.show()
In [70]:
cnf_matrix_ABC = confusion_matrix(y_test, y_predicted_ABC_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_ABC, classes='1',
title='ABC Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_ABC, classes='1', normalize=True,
title='ABC Normalized confusion matrix')
plt.show()
In [71]:
cnf_matrix_LR = confusion_matrix(y_test, y_predicted_LR_b)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LR, classes='1',
title='LR Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LR, classes='1', normalize=True,
title='LR Normalized confusion matrix')
plt.show()
The following ensemble model only aggregate five models:
In [72]:
eclf3 = VotingClassifier(estimators=[
('rbf',clf_RBF), ('sigmoid',clf_SIGMOID), ('DTC',clf_DTC),
('RFC',clf_RFC),('EFC',clf_EFC)],
voting='hard')
eclf3 = eclf3.fit(X_train, y_train)
y_predict_eclf3 = eclf3.predict(X_test)
print (eclf3.get_params(deep=True))
print (eclf3.score(X_train, y_train, sample_weight=None))
In [73]:
print(y_predict_eclf3)
In [74]:
scores_eclf3 = cross_val_score(eclf3 , X_train, y_train, cv=4, n_jobs=-1)
print(scores_eclf3 )
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_eclf3.mean(), scores_eclf3.std() * 2))
print (metrics.classification_report(y_test, y_predict_eclf3))
In [75]:
cnf_matrix_eclf3 = confusion_matrix(y_test, y_predict_eclf3)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf3, classes='1',
title='ECLF Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf3, classes='1', normalize=True,
title='ECLF Normalized confusion matrix')
plt.show()
This ensemble voting model aggregates decision tree and the extremely tree models:
In [76]:
eclf4 = VotingClassifier(estimators=[
('DTC',clf_DTC), ('EFC',clf_EFC)],
voting='hard')
eclf4 = eclf4.fit(X_train, y_train)
y_predict_eclf4 = eclf4.predict(X_test)
print (eclf4.get_params(deep=True))
print (eclf4.score(X_train, y_train, sample_weight=None))
In [77]:
scores_eclf4 = cross_val_score(eclf4 , X_train, y_train, cv=4, n_jobs=-1)
print(scores_eclf4 )
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_eclf4.mean(), scores_eclf4.std() * 2))
print (metrics.classification_report(y_test, y_predict_eclf4))
In [78]:
cnf_matrix_eclf4 = confusion_matrix(y_test, y_predict_eclf4)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf4, classes='1',
title='ECLF Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf4, classes='1', normalize=True,
title='ECLF Normalized confusion matrix')
plt.show()
The best model found here is the ensemble model with the decision tree and the extremely tree, even though the ensemble model with five aggregate methods shows a slightly better score, applying the principle of Occam's razor make the simplest better. At this time, the resulting model can accurately predict the deaths given a series of medical examinations, as proposed in the hypothesis. While accuracy is not the best (76%), I think it may be a good start for future investigations of interdisciplinary teams in ICU forecasting diseases.
From the forest model is possible to find how features weights in the results, such weights are called importance. As you can see only 129 features are important to the model, the rest has no weights. As you can expect, the antibiotic sensitivity are the most important feature (together weights 57% of importance) and AMIKACIN antibiotic the most important feature of the sample. Every feature from age group weight 0.97% in average, followed by category feature which everyone weights less than 1%.
In [88]:
# Code source: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
from sklearn.datasets import make_classification
forest = clf_EFC_b
forest.fit(X_train, y_train)
feature_names = X_train.columns
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(X_train.shape[1]):
if importances[indices[f]]>0:
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
#plt.xticks(range(heart_train.shape[1]), )
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names)
plt.xlim([-1, X_train.shape[1]])
plt.show()
In [108]:
from sklearn.datasets import make_classification
forest = clf_EFC_b
forest.fit(X_train, y_train)
feature_names = X_train.columns
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
impor = []
for f in range(X_train.shape[1]):
if importances[indices[f]]>0:
impor.append({'Feature': feature_names[f] , 'Importance': importances[indices[f]]})
feature_importance = pd.DataFrame(impor).sort_values('Importance',ascending = False)
print(feature_importance.to_string())
In [103]:
feature_importance.info()
In [105]:
feature_importance['Importance'].sum()
Out[105]:
In [122]:
#surveys_df[surveys_df.year == 2002]
more_important= feature_importance[feature_importance.Importance >= 0.009]
more_important
Out[122]:
In [127]:
more_important.plot(kind='bar')
Out[127]:
In [ ]: