Machine Learning Engineer Nanodegree

Capstone Project

Santiago Giraldo

July 29, 2017

Perhaps one of the trendiest topics in the world right now is machine learning and big data, which have become recurring topics on Main Street. The application of these in different fields including branches of knowledge that range from astronomy to manufacturing, but perhaps the field of medicine is one the most interesting of them all. Current advances in this field include disease diagnosis, image analysis, new drug developments and pandemic forecasts among others. These progress poses new challenges and opportunities to improve the quality of treatments and the life expectancy of human beings.

Being able to participate in something that contributes to human well-being motivated me to research and apply machine learning in an area of medicine to the capstone project.

From personal experience, a recurring problem that affects people at certain ages is pneumonia, which in some cases is usually fatal especially when it refers to children or older adults. Knowing this and the availability of records in Intensive Care Unit MIMIC-III database, I decided to select pneumonia deaths as a theme to develop the capstone project, that in addition, could be useful for the prognosis of deaths in Intensive Care Units, and with further investigations be the outset for development of tools that doctors and nurses could use to improve their work.

The hypothesis is that from microbiological variables coming from tests it’s possible predict whether a patient can or not die by this disease. This leads us to a problem that can be represented binarily, and that can be related to physicochemical variables obtained from microbiological tests. These relationships allow the modeling of the pneumonia death problem by means of a supervised learning model such as the Support Vector Machines, Decision Trees, Logistic Regression and Ada boost ensemble.



In [1]:

    
import numpy as np
import pandas as pd
import datetime
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import psycopg2
import time
import itertools
from pandas.io.sql import read_sql
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import VotingClassifier
from sklearn import metrics

The MIMIC III database consists of a collection of csv files, which can be imported into PostgreSQL. Once imported to PostgreSQL, is possible to use libraries from python to analyze the different data contained in it, make the necessary transformations to implement the desired forecast model. The input variables will be defined from different tables and seeking relate the independent binary variable (life or death) of subjects with age, sex, results of microbiological events (test results) and severity of these results.

Before you can download the data, you must complete the CITI "Data or Specimens Only Research" course . Once you accomplish the course you can download the data from https://physionet.org/works/MIMICIIIClinicalDatabase/. Then, the first step was to understand the structure of the database. This consists of a collection of 26 csv files, which can be imported into PostgreSQL. These files contain medical, economic, demographic and death of patients admitted information for several years at the ICU of Beth Israel Deaconess Medical Center, as the data is sensitive, some records were changed like date of admittance and date of birth, in order to avoid the identification of the patients from these records, and this information will be misused in the future.



In [2]:

    
conn=psycopg2.connect(
    dbname='mimic',
    user='postgres',
    host='localhost',
    port=5432,
    password= 123
)
cur = conn.cursor()
process = time.process_time()
print (process)

The first step was creating four tables to facilitate consult the required data for the analysis, these tables are:

last_event: This table born from a join of patients and admissions tables. In this, was selected the fields subject_id, dob, and gender. The age is computed for all patients, the last admission column is created and all age are classified by age groups as categorical variable.

age: Is a join between last_event and admission tables. In this, I selected the subject_id, last_admit_age, gender, last_admit_time, but the records are limited to last patient admission (there are records for several admissions for some patients, so is important filter the last one to related the records with deaths when these occur) computed in last_event table.

valuenum_avg: In a first instance, I have grouped the 14 tables that have the data records of graphical events. As a group, it is the largest table in the database and it contains 330,712,483 records. Given the size of the data, hardware constraints, I considered a strong assumption, and is that the records in this table where the numerical value (valuenum) of these graphic events are measured, can be averaged (huge assumption) can serve as a numerical dependent variable within the models to be studied. It is a huge assumption because I have no evidence from other studies, at least as far as I know, the results average can be done. But on the other hand, you can think by experience (as patient because I’m not physician), the results from exams are a good estimation, and the issue, at least for me, is if this data could be averaged as I did and if it could be a good proxy regressor. For this table, I take this data: subject_id, hadm_id, itemid, and compute valuenum_avg.

pneumonia: It is the most important table for this study because I group the relevant data from others tables like microbiology events, charted events, and demographic data. The specific fields grouped here are: hospital_expire_flag, subject_id, hadm_id, gender, last_admittime, last_admit_age, age_group, itemid, label, category, valuenum_avg, icd9_code, short_title, spec_type_desc, org_name, ab_name, interpretation. And this data where filtered by pneumonia word in long_title diagnosis field, values not null in interpretation in microbiology events, values not null in category laboratory items and admittime is equal to last_admittime. The objective here is assuring me that data is complete (not null records), is related with pneumonia diagnosis and the records selected where from the last admission.

The final result of these process is a sql query which filter and transform the data in a matrix with this columns fields: hospital_expire_flag, subject_id, gender, last_admit_age, age_group, category, label, valuenum_avg, org_name, ab_name, interpretation, and 2,430,640 records to 829 patients with some diagnosis related with pneumonia panda data frame



In [3]:

    
sql_sl = " SELECT hospital_expire_flag, subject_id, gender, last_admit_age, age_group, category, \
            label, valuenum_avg, org_name,ab_name, interpretation \
            FROM mimiciii.pneumonia;"

patients_pn= read_sql(sql_sl, conn, coerce_float=True, params=None)
print (patients_pn.info())
process = time.process_time()
print (process)









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2430640 entries, 0 to 2430639
Data columns (total 11 columns):
hospital_expire_flag    int64
subject_id              int64
gender                  object
last_admit_age          float64
age_group               object
category                object
label                   object
valuenum_avg            float64
org_name                object
ab_name                 object
interpretation          object
dtypes: float64(2), int64(2), object(7)
memory usage: 204.0+ MB
None
15.71875

For each subject, I categorized the population in five types of groups according to the age recorded at the time of admission to the ICU, which are neonates [0,1], middle (1, 14), adults (14, 65), Older adults [65, 85] and older elderly people (85, 91.4].



In [4]:

    
patients_pn.head()









    Out[4]:






  
    
      
      hospital_expire_flag
      subject_id
      gender
      last_admit_age
      age_group
      category
      label
      valuenum_avg
      org_name
      ab_name
      interpretation
    
  
  
    
      0
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      TETRACYCLINE
      S
    
    
      1
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      RIFAMPIN
      S
    
    
      2
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      ERYTHROMYCIN
      R
    
    
      3
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      GENTAMICIN
      S
    
    
      4
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      VANCOMYCIN
      S

In the Website they explain that the average age of these patients is 91.4 years, reason why I decided that if I want have some consistent data from this segment of the population I should replace it at least for its average value



In [5]:

    
row_index = patients_pn.last_admit_age >= 300
patients_pn.loc[row_index , 'last_admit_age' ] = 91.4  #https://mimic.physionet.org/mimictables/patients/

Hospital_expire_flag has the binary dependent variable to the model, 0 when the patient goes out from ICU alive, and 1 when the patient has deceased while stay in ICU.

Subject_id is the key value which relates the respective record with an acute patient in ICU. gender give the patient sex of the subject. Last_admit_age (is a computed field) has the age when the patient is admitted in ICU.

Age_group (is a computed field) serves to categorize the sample by age.

Valuenum_avg is the average number for valuenum of this respective label measure. Org_name contains the names of the microorganisms (bacteria) related to pneumonia, where the main ones are staph aureus coag +, klebsiella pneumoniae, escherichia coli, pseudomonas aeruginosa, staphylococcus, coagulase negative, klebsiella oxytoca, enterococcus sp, which represent 80% of the sample.

Category is employed to categorize the charted events, the main categories of this data column are Labs, Respiratory, Alarms, Routine Vital Signs, Chemistry which gathering the 82% of the records which are present in this query.



In [6]:

    
patients_pn['category'].unique()









    Out[6]:





array(['ABG', 'Chemistry', 'Hematology', 'Coags', 'Enzymes', "ABG's",
       'Drug Level', 'Routine Vital Signs', 'Alarms', 'Respiratory',
       'Labs', 'General', 'Blood Gases', "VBG's", 'Skin - Impairment',
       "VBG'S", 'Mixed Venous Gases', 'Heme/Coag', 'Other ABGs', 'CSF',
       'Drug level', "ABG'S", 'Mixed VBGs', 'Urine', 'OT Notes',
       'Hemodynamics', 'Dialysis', 'Pain/Sedation', 'GI/GU', 'NICOM',
       'Cardiovascular (Pacer Data)', 'IABP', 'PiCCO', 'Adm History/FHPA',
       'Access Lines - Invasive', 'Treatments', 'Scores - APACHE IV (2)',
       'Scores - APACHE II', 'Venous ABG', 'Quick Admit', 'OB-GYN'], dtype=object)



In [7]:

    
patients_pn['category'].value_counts()









    Out[7]:





Labs                           951194
Respiratory                    538414
Alarms                         255555
Routine Vital Signs            157533
Chemistry                      106007
General                         87434
Skin - Impairment               56457
Hematology                      52049
Dialysis                        35259
Hemodynamics                    27275
ABG                             25475
Coags                           25359
Enzymes                         24121
OT Notes                        17580
Pain/Sedation                   10095
Drug Level                       8189
Cardiovascular (Pacer Data)      7590
Scores - APACHE IV (2)           7038
Scores - APACHE II               4950
ABG's                            3992
PiCCO                            3975
VBG's                            3528
NICOM                            3318
GI/GU                            2793
Blood Gases                      2782
IABP                             2150
Mixed Venous Gases               1819
Treatments                       1698
Heme/Coag                        1628
VBG'S                            1362
Other ABGs                       1050
Adm History/FHPA                  632
ABG'S                             473
Mixed VBGs                        420
Access Lines - Invasive           342
Venous ABG                        316
CSF                               315
Drug level                        149
Urine                             139
OB-GYN                            124
Quick Admit                        61
Name: category, dtype: int64



In [8]:

    
patients_category = patients_pn['category'].value_counts().reset_index()
patients_category.columns=['category','Count']
patients_category['Count'].apply('{:,.2f}'.format)
patients_category['cum_perc'] = 100*patients_category.Count/patients_category.Count.sum()
patients_category['cum_perc'] = patients_category['cum_perc'].map('{:,.4f}%'.format)

print (patients_category)









    



                       category   Count  cum_perc
0                          Labs  951194  39.1335%
1                   Respiratory  538414  22.1511%
2                        Alarms  255555  10.5139%
3           Routine Vital Signs  157533   6.4811%
4                     Chemistry  106007   4.3613%
5                       General   87434   3.5972%
6             Skin - Impairment   56457   2.3227%
7                    Hematology   52049   2.1414%
8                      Dialysis   35259   1.4506%
9                  Hemodynamics   27275   1.1221%
10                          ABG   25475   1.0481%
11                        Coags   25359   1.0433%
12                      Enzymes   24121   0.9924%
13                     OT Notes   17580   0.7233%
14                Pain/Sedation   10095   0.4153%
15                   Drug Level    8189   0.3369%
16  Cardiovascular (Pacer Data)    7590   0.3123%
17       Scores - APACHE IV (2)    7038   0.2896%
18           Scores - APACHE II    4950   0.2037%
19                        ABG's    3992   0.1642%
20                        PiCCO    3975   0.1635%
21                        VBG's    3528   0.1451%
22                        NICOM    3318   0.1365%
23                        GI/GU    2793   0.1149%
24                  Blood Gases    2782   0.1145%
25                         IABP    2150   0.0885%
26           Mixed Venous Gases    1819   0.0748%
27                   Treatments    1698   0.0699%
28                    Heme/Coag    1628   0.0670%
29                        VBG'S    1362   0.0560%
30                   Other ABGs    1050   0.0432%
31             Adm History/FHPA     632   0.0260%
32                        ABG'S     473   0.0195%
33                   Mixed VBGs     420   0.0173%
34      Access Lines - Invasive     342   0.0141%
35                   Venous ABG     316   0.0130%
36                          CSF     315   0.0130%
37                   Drug level     149   0.0061%
38                        Urine     139   0.0057%
39                       OB-GYN     124   0.0051%
40                  Quick Admit      61   0.0025%



In [9]:

    
patients_pn['category'].value_counts().plot(kind='bar')









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x23014e68b70>

Label is the detail of the category, and is represented in 578 labels, where the most important are: Hemoglobin Arterial Base Excess, Phosphorous, WBC, Creatinine, Magnesium, PTT, INR, ALT, AST, Lactic Acid. And the largest amount (Hemoglobin Arterial Base Excess) represents 0.94% of the sample and the lowest (Lactic Acid) 0.82% of the sample.



In [10]:

    
patients_pn['label'].unique()









    Out[10]:





array(['Arterial Base Excess', 'Arterial CO2(Calc)', 'Arterial PaCO2',
       'Arterial PaO2', 'Arterial pH', 'BUN (6-20)', 'Calcium (8.4-10.2)',
       'Carbon Dioxide', 'Chloride (100-112)', 'Creatinine (0-1.3)',
       'Differential-Bands', 'Differential-Basos', 'Differential-Eos',
       'Differential-Lymphs', 'Differential-Monos', 'Differential-Polys',
       'Fingerstick Glucose', 'Glucose (70-105)', 'Hematocrit',
       'Hemoglobin', 'INR (2-4 ref. range)', 'Ionized Calcium', 'LDH',
       'Lactic Acid(0.5-2.0)', 'Magnesium (1.6-2.6)', 'PT(11-13.5)',
       'PTT(22-35)', 'Phosphorous(2.7-4.5)', 'Platelets',
       'Potassium (3.5-5.3)', 'RBC', 'SaO2', 'Sodium (135-148)',
       'Vancomycin/Random', 'WBC (4-11,000)', 'Art.pH', 'WBC   (4-11,000)',
       'Calcium', 'Chloride', 'Creatinine', 'Glucose', 'INR',
       'Lactic Acid', 'Magnesium', 'PTT', 'Phosphorous', 'Potassium',
       'Sodium', 'WBC', 'Heart Rate', 'Heart rate Alarm - High',
       'Heart Rate Alarm - Low', 'Non Invasive Blood Pressure systolic',
       'Non Invasive Blood Pressure diastolic',
       'Non Invasive Blood Pressure mean', 'Respiratory Rate',
       'O2 saturation pulseoxymetry', 'Hematocrit (serum)', 'AST',
       'Chloride (serum)', 'Cholesterol', 'Glucose (serum)', 'HDL', 'ALT',
       'Sodium (serum)', 'Non-Invasive Blood Pressure Alarm - High',
       'Non-Invasive Blood Pressure Alarm - Low', 'Temperature Fahrenheit',
       'O2 Saturation Pulseoxymetry Alarm - High',
       'O2 Saturation Pulseoxymetry Alarm - Low', 'O2 Flow',
       'Inspired O2 Fraction', 'Resp Alarm - High', 'Resp Alarm - Low',
       'Daily Weight', 'Alkaline Phosphate', 'BUN', 'Calcium non-ionized',
       'CK (CPK)', 'LDL calculated', 'Total Bilirubin', 'Triglyceride',
       'SpO2 Desat Limit', 'Admission Weight (Kg)',
       'Admission Weight (lbs.)', 'Height', 'Height (cm)', 'Anion gap',
       'Troponin-T', 'Potassium (serum)', 'HCO3 (serum)', 'CK-MB',
       'Platelet Count', 'FK506', 'Prothrombin time', 'Fibrinogen', 'CPK',
       'CPK/MB', 'Digoxin', 'Albumin (>3.2)', 'Alk. Phosphate', 'Amylase',
       'Mixed Venous O2', 'Mixed Venous O2% Sat', 'Total Bili (0-1.5)',
       'Venous pH', 'Minute Volume Alarm - Low',
       'Minute Volume Alarm - High', 'PEEP set', 'PH (dipstick)',
       'Inspired Gas Temp.', 'Paw High', 'Vti High', 'Fspn High',
       'Apnea Interval', 'MDI #1 Puff', 'MDI #2 Puff', 'Cuff Pressure',
       'Spont Vt', 'Spont RR', 'Tidal Volume (set)',
       'Tidal Volume (observed)', 'Tidal Volume (spontaneous)',
       'Minute Volume', 'Respiratory Rate (Set)',
       'Respiratory Rate (spontaneous)', 'Respiratory Rate (Total)',
       'Peak Insp. Pressure', 'Plateau Pressure', 'Mean Airway Pressure',
       'Total PEEP Level', 'PSV Level', 'Inspiratory Time',
       'Impaired Skin Length #2', 'Impaired Skin Width #2',
       'Differential-Neuts', 'Direct Bilirubin', 'Glucose finger stick',
       'Lipase', 'Expiratory Ratio', 'Inspiratory Ratio',
       'Vancomycin (Trough)', 'Albumin', 'Specific Gravity (urine)',
       'Ventilator Tank #1', 'Ventilator Tank #2', 'Fibrinogen (150-400)',
       'Troponin', 'Direct Bili (0-0.3)', 'Total Protein(6.5-8)',
       'Triglyceride (0-200)', 'Direct Bili', 'Total Bili',
       'Total Protein', 'Uric Acid (2.7-7.0)', 'Gentamycin/Random',
       'Tobramycin/Random', 'Venous CO2(Calc)', 'Venous PvCO2',
       'Venous PvO2', 'Uric Acid', 'Arterial O2 pressure',
       'Arterial O2 Saturation', 'Arterial CO2 Pressure', 'PH (Venous)',
       'Ammonia', 'TCO2 (calc) Venous', 'PH (Arterial)',
       'Differential-Atyps', 'Total Granulocyte Count (TGC)',
       'TCO2 (calc) Arterial', 'Venous CO2 Pressure', 'Venous O2 Pressure',
       'Gentamicin (Random)', 'Gentamicin (Peak)', 'Gentamicin (Trough)',
       'Tobramycin (Trough)', 'Vancomycin (Random)',
       'Potassium (whole blood)', 'PO2 (Mixed Venous)', 'Sed Rate',
       'Vancomycin/Peak', 'Vancomycin/Trough', 'Serum Osmolality', 'SvO2',
       'D-Dimer (0-500)', 'D-Dimer', 'Dilantin', 'ACT (102-142)',
       'Cholesterol (<200)', 'ACT', 'Ethanol', 'Alkaline Phosphatase',
       'Anion Gap   (8-20)', 'BANDS', 'BASOs', 'BE', 'BUN    (6-20)',
       'Bands', 'Base Excess (other)', 'Basos', 'Calcium   (8.8-10.8)',
       'Chloride  (100-112)', 'Creatinine   (0-0.7)', 'Eosinophils',
       'Gentamicin Post 5-10', 'Gentamicin Pre (0-2)', 'HGB  (10.8-15.8)',
       'Hematocrit (35-51)', 'Indirect Bili(0-1.0)', 'LYMPHS', 'Lymphs',
       'MONOs', 'Mix Venous CO2(calc)', 'Mix Venous PCO2',
       'Mix Venous PO2', 'Mix Venous pH', 'Monos', 'NEUTS', 'PCO2', 'PO2',
       'Platelet  (150-440)', 'Polys', 'Potassium  (3.5-5.3)',
       'Red Blood C(3.6-6.2)', 'Sodium  (135-148)', 'TCO2        (21-30)',
       'TCO2 (other)', 'Total CO2', 'Triglyceride (0-250)',
       'Urine Glucose', 'Urine Ketones', 'Urine Leukocytes',
       'Vancomycin Post', 'Vancomycin Pre', 'WhiteBloodC 4.0-11.0', 'pCO2',
       'pCO2 (other)', 'pO2', 'pO2 (other)', 'ph (other)',
       'Base Excess (cap)', 'RBC(3.6-6.2)', 'TCO2 (cap)', 'WBC 4.0-11.0',
       'pCO2 (cap)', 'pH (cap)', 'pO2 (cap)', 'pH (Art)',
       'Gentamycin/Peak', 'Gentamycin/Trough',
       'Arterial Blood Pressure systolic',
       'Arterial Blood Pressure diastolic', 'Arterial Blood Pressure mean',
       'Arterial Blood Pressure Alarm - Low',
       'Arterial Blood Pressure Alarm - High', 'ATC %', 'P High (APRV)',
       'P Low (APRV)', 'T High (APRV)', 'T Low (APRV)',
       'Rest HR - Aerobic Capacity', 'Rest O2 Sat - Aerobic Capacity',
       'Central Venous Pressure Alarm - High',
       'Central Venous Pressure  Alarm - Low', 'Central Venous Pressure',
       'Central Venous O2% Sat', 'Blood Flow (ml/min)',
       'Heparin Dose (per hour)', 'Access Pressure', 'Filter Pressure',
       'Effluent Pressure', 'Return Pressure', 'Replacement Rate',
       'Dialysate Rate', 'Hourly Patient Fluid Removal',
       'ART Lumen Volume', 'VEN Lumen Volume', 'Cuff Volume (mL)',
       'Baseline Current/mA', 'Current Used/mA', 'Impaired Skin Depth #8',
       'Impaired Skin Length #8', 'Impaired Skin Width #8', 'Current Goal',
       'CK-MB fraction (%)', 'Estimated Energy Needs/Kg',
       'Estimated Protein Needs/Kg', 'Ultrafiltrate Output',
       'Hemodialysis Output', 'Glucose (whole blood)',
       'Hematocrit (whole blood - calc)', 'Feeding Weight',
       'Brain Natiuretic Peptide (BNP)', 'Cortisol', 'Temperature Celsius',
       'Rest RR - Aerobic Capacity', 'Activity HR - Aerobic Capacity',
       'Activity RR - Aerobic Capacity',
       'Activity O2 Sat - Aerobic Capacity',
       'Recovery HR - Aerobic Capacity', 'Recovery RR - Aerobic Capacity',
       'Recovery O2 Sat - Aerobic Capacity', 'Bladder Pressure',
       'Impaired Skin Length #1', 'Impaired Skin Depth #1',
       'Impaired Skin Width #1', 'Impaired Skin Depth #2',
       'Impaired Skin Depth #3', 'Impaired Skin Depth #4',
       'Impaired Skin Length #3', 'Impaired Skin Length #4',
       'Impaired Skin Width #3', 'Impaired Skin Width #4',
       'Sodium (whole blood)', 'Chloride (whole blood)',
       'Pinsp (Draeger only)', 'O2 Flow (additional cannula)',
       'C Reactive Protein (CRP)', 'ART Blood Pressure Alarm - High',
       'ART Blood Pressure Alarm - Low', 'GI pH',
       'Rest HR -  Aerobic Activity Response',
       'Rest RR - Aerobic Activity Response',
       'Rest O2 Sat - Aerobic Activity Response',
       'Activity HR - Aerobic Activity Response',
       'Activity RR - Aerobic Activity Response',
       'Activity O2 sat - Aerobic Activity Response',
       'Recovery HR - Aerobic Activity Response',
       'Recovery RR - Aerobic Activity Response',
       'Recovery O2 sat - Aerobic Activity Response', 'Citrate (ACD-A)',
       'PBP (Prefilter) Replacement Rate', 'Post Filter Replacement Rate',
       'Cardiac Index (CI NICOM)', 'Cardiac Output (CO NICOM)',
       'CO / CI Change', 'Stroke Volume (SV NICOM)',
       'Stroke Volume Index (SVI NICOM)',
       'Stroke Volume Variation (SVV NICOM)', 'SVI Change',
       'Thoracic Fluid Content (TFC) (NICOM)',
       'Total Peripheral Resistance (TPR) (NICOM)',
       'Total Peripheral Resistance Index (TPRI) (NICOM)', 'PCA dose',
       'PCA lockout (min)', 'PCA 1 hour limit', 'PCA basal rate (mL/hour)',
       'PCA inject', 'PCA bolus', 'PCA attempt', 'PCA total dose',
       'Activated Clotting Time', 'ART BP Systolic', 'ART BP Diastolic',
       'ART BP mean', 'BiPap EPAP', 'BiPap IPAP', 'BiPap O2 Flow',
       'Intra Cranial Pressure', 'Intra Cranial Pressure Alarm - High',
       'Intra Cranial Pressure Alarm - Low', 'Cerebral Perfusion Pressure',
       'Cerebral Perfusion Pressure Alarm - High',
       'Cerebral Perfusion Pressure Alarm - Low',
       'Manual Blood Pressure Systolic Left',
       'Manual Blood Pressure Diastolic Left', 'Impaired Skin Length #5',
       'Impaired Skin Width #5', 'Pulmonary Artery Pressure systolic',
       'Pulmonary Artery Pressure diastolic',
       'Pulmonary Artery Pressure mean',
       'Pulmonary Artery Pressure Alarm - High',
       'Pulmonary Artery Pressure Alarm - Low', 'Temporary AV interval',
       'PA Line cm Mark', 'Temporary Pacemaker Wires Venticular',
       'Assisted Systole', 'Augmented Diastole', 'BAEDP', 'IABP Mean',
       'Unassisted Systole', 'PAEDP', 'Temporary Pacemaker Rate',
       'Temporary Atrial Sens Setting mV',
       'Temporary Atrial Stim Setting mA',
       'Temporary Ventricular Sens Setting mV',
       'Temporary Ventricular Stim Setting mA',
       'Temporary Pacemaker Wires Atrial', 'Cardiac Output (CCO)',
       'Permanent Pacemaker Rate', 'Blood Temperature CCO (C)',
       'Flow Rate (L/min)', 'Impaired Skin Length #6',
       'Impaired Skin Width #6', 'Tobramycin (Peak)',
       'Tobramycin (Random)', 'Nitric Oxide',
       'Transpulmonary Pressure (Exp. Hold)',
       'Transpulmonary Pressure (Insp. Hold)', 'Small Volume Neb Dose #2',
       'Negative Insp. Force', 'Impaired Skin Length #7',
       'Impaired Skin Width #7', 'CO (Arterial)', 'SVV (Arterial)',
       'SV (Arterial)', 'ScvO2 (Presep)', 'Impaired Skin Length #9',
       'Impaired Skin Width #9', 'Impaired Skin Depth #6',
       'Medication Infusion Rate - Adjunctive Pain Management',
       'CI (PiCCO)', 'SVI (PiCCO)', 'SVV (PiCCO)', 'SVRI (PiCCO)',
       'MDI #3 Puff', 'QTc', 'Recruitment Duration',
       'Manual Blood Pressure Diastolic Right',
       'Manual Blood Pressure Systolic Right', 'Vital Cap',
       'Phenytoin (Dilantin)', 'Cardiac Output (thermodilution)',
       'Temporary Venticular Sens Threshold mV',
       'Temporary Venticular Stim Threshold mA', 'Recruitment Press',
       'Temporary Atrial Sens Threshold mV', 'CFI (PiCCO)', 'CO (PiCCO)',
       'ELWI (PiCCO)', 'GEDI (PiCCO)', 'Temporary Pacemaker Wires Ground',
       'Temporary Atrial Stim Threshold mA', 'Cash amount', 'IABP Volume',
       'Arctic Sun/Alsius Set Temp', 'Arctic Sun Water Temp',
       'Arctic Sun/Alsius Temp #1 C', 'Arctic Sun/Alsius Temp #2 C',
       'LDL measured', 'Phenytoin (Free)', 'Intra Cranial Pressure #2',
       'PCWP', 'Nitric Oxide Tank Pressure', 'Vancomycin (Peak)',
       'BIS Index Range', 'Impaired Skin Depth #5',
       'Impaired Skin Depth #7', 'CO2 production', 'Vd/Vt Ratio',
       'Pulsus Paradoxus', 'Skeletal Traction #1 - Pounds',
       'Phenobarbital', 'Intra Cranial Pressure #2 Alarm - Low',
       'BiPap bpm (S/T -Back up)', 'Epidural Infusion Rate (mL/hr)',
       'Epidural Bolus (mL)', 'PCV Level (Avea)', 'Alsius Bath Temp',
       'Apache IV A-aDO2', 'APACHEIII', 'ApacheIV_LOS',
       'Epidural Total Dose (mL)', 'Thrombin', 'Impaired Skin Length #10',
       'Impaired Skin Width #10', 'ABI Brachial BP L', 'ABI Ankle BP R',
       'ABI Ankle BP L', 'ABI Brachial BP R', 'Volume Out (PD)',
       'Resistance', 'AaDO2ApacheIIValue', 'AgeApacheIIScore',
       'AgeApacheIIValue', 'APACHE II', 'APACHE II PDR - Adjusted',
       'APACHE II Predecited Death Rate', 'ChpApacheIIScore',
       'CreatinineApacheIIScore', 'CreatinineApacheIIValue',
       'DswfApacheScore', 'FiO2ApacheIIValue', 'GcsApacheIIScore',
       'HCO3ApacheIIValue', 'HCO3Score', 'HematocritApacheIIScore',
       'HematocritApacheIIValue', 'HrApacheIIScore', 'HrApacheIIValue',
       'MapApacheIIScore', 'MapApacheIIValue', 'OxygenApacheIIScore',
       'PhApacheIIScore', 'PHApacheIIValue', 'PotassiumApacheIIScore',
       'PotassiumApacheIIValue', 'RrApacheIIScore', 'RRApacheIIValue',
       'SodiumApacheIIScore', 'SodiumApacheIIValue', 'TempApacheIIScore',
       'TempApacheIIValue', 'WbcApacheIIScore', 'WBCApacheIIValue',
       'AgeScore_ApacheIV', 'Albumin_ApacheIV', 'AlbuminScore_ApacheIV',
       'Apache IV Age', 'Apache IV PaFiRatio',
       'ApacheIV_Mortality prediction', 'ApacheIV_Natural antilog', 'APS',
       'Bilirubin_ApacheIV', 'BiliScore_ApacheIV', 'BUN_ApacheIV',
       'BunScore_ApacheIV', 'ChronicScore_ApacheIV', 'Creatinine_ApacheIV',
       'CreatScore_ApacheIV', 'Ejection Fraction', 'FiO2_ApacheIV',
       'GcsScore_ApacheIV', 'Glucose_ApacheIV', 'GlucoseScore_ApacheIV',
       'Hematocrit_ApacheIV', 'HR_ApacheIV', 'HrScore_ApacheIV',
       'HtScore_ApacheIV', 'LOS pre-ICU admission', 'MAP_ApacheIV',
       'MapScore_ApacheIV', 'OxygenScore_ApacheIV', 'PCO2_ApacheIV',
       'PH_ApacheIV', 'PHPaCO2Score_ApacheIV', 'PO2_ApacheIV',
       'RR_ApacheIV', 'RRScore_ApacheIV', 'Sodium_ApacheIV',
       'SodiumScore_ApacheIV', 'TemperatureF_ApacheIV',
       'TempScore_ApacheIV', 'UrineScore_ApacheIV', 'WBC_ApacheIV',
       'WBCScore_ApacheIV', 'Urine output_ApacheIV', 'Procan',
       'Procan Napa', 'Thrombin (16-21)', 'ABG POTASSIUM',
       'Albumin  (3.9-4.8)', 'Blood Glucose', 'Ion Calcium', 'ProTime',
       'Ptt', 'SGOT', 'SGPT', 'Venous Base Excess', 'Venous CO2',
       'Venous O2', 'Venous TCO2', 'ABG Potassium', 'ABG CHLOIRDE',
       'ABG SODIUM', 'BloodGlucose', 'T. Protein (5-7.5)', 'SOFA Score',
       'Volume In (PD)', 'Dwell Time (Peritoneal Dialysis)', 'Doppler BP',
       'Intra Cranial Pressure #2 Alarm - High', 'Clot Size (cm)',
       'Impaired Skin Depth #9', 'Impaired Skin Depth #10',
       'Left Ventricular Assit Device Flow', 'ECMO'], dtype=object)



In [11]:

    
patients_pn['label'].value_counts()









    Out[11]:





Hemoglobin                                22879
Arterial Base Excess                      21749
Phosphorous                               21579
WBC                                       21495
Creatinine                                21495
Magnesium                                 21474
PTT                                       21208
INR                                       21129
ALT                                       20185
AST                                       20172
Lactic Acid                               20031
Ionized Calcium                           19387
Differential-Monos                        19065
Differential-Lymphs                       19065
Differential-Eos                          19065
Differential-Basos                        19065
Albumin                                   18678
Hematocrit (serum)                        18412
Platelet Count                            18412
BUN                                       18412
Glucose (serum)                           18412
Potassium (serum)                         18412
HCO3 (serum)                              18412
Chloride (serum)                          18412
Anion gap                                 18401
Sodium (serum)                            18401
Calcium non-ionized                       18373
Prothrombin time                          18068
LDH                                       17577
TCO2 (calc) Arterial                      17494
                                          ...  
Resistance                                   65
Impaired Skin Depth #9                       65
ABG SODIUM                                   61
ABG CHLOIRDE                                 61
BloodGlucose                                 61
T. Protein (5-7.5)                           61
Pulsus Paradoxus                             60
Impaired Skin Depth #8                       55
Skeletal Traction #1 - Pounds                55
Temporary AV interval                        53
Tobramycin/Random                            50
Intra Cranial Pressure #2 Alarm - Low        47
Procan Napa                                  44
Procan                                       44
Dwell Time (Peritoneal Dialysis)             42
Volume In (PD)                               42
Ethanol                                      40
Doppler BP                                   40
Impaired Skin Depth #10                      36
Urine Leukocytes                             26
pH (Art)                                     26
Urine Ketones                                26
ABG Potassium                                18
Intra Cranial Pressure #2                    18
Intra Cranial Pressure #2 Alarm - High       16
SOFA Score                                   10
ABI Brachial BP L                             8
ABI Ankle BP L                                8
ABI Brachial BP R                             8
ABI Ankle BP R                                8
Name: label, dtype: int64



In [12]:

    
patients_label = patients_pn['label'].value_counts().reset_index()
patients_label.columns=['label','Count']
patients_label['Count'].apply('{:,.2f}'.format)
patients_label['cum_perc'] = 100*patients_label.Count/patients_label.Count.sum()
patients_label['cum_perc'] = patients_label['cum_perc'].map('{:,.4f}%'.format)

print (patients_label)









    



                                      label  Count cum_perc
0                                Hemoglobin  22879  0.9413%
1                      Arterial Base Excess  21749  0.8948%
2                               Phosphorous  21579  0.8878%
3                                       WBC  21495  0.8843%
4                                Creatinine  21495  0.8843%
5                                 Magnesium  21474  0.8835%
6                                       PTT  21208  0.8725%
7                                       INR  21129  0.8693%
8                                       ALT  20185  0.8304%
9                                       AST  20172  0.8299%
10                              Lactic Acid  20031  0.8241%
11                          Ionized Calcium  19387  0.7976%
12                       Differential-Monos  19065  0.7844%
13                      Differential-Lymphs  19065  0.7844%
14                         Differential-Eos  19065  0.7844%
15                       Differential-Basos  19065  0.7844%
16                                  Albumin  18678  0.7684%
17                       Hematocrit (serum)  18412  0.7575%
18                           Platelet Count  18412  0.7575%
19                                      BUN  18412  0.7575%
20                          Glucose (serum)  18412  0.7575%
21                        Potassium (serum)  18412  0.7575%
22                             HCO3 (serum)  18412  0.7575%
23                         Chloride (serum)  18412  0.7575%
24                                Anion gap  18401  0.7570%
25                           Sodium (serum)  18401  0.7570%
26                      Calcium non-ionized  18373  0.7559%
27                         Prothrombin time  18068  0.7433%
28                                      LDH  17577  0.7231%
29                     TCO2 (calc) Arterial  17494  0.7197%
..                                      ...    ...      ...
548                              Resistance     65  0.0027%
549                  Impaired Skin Depth #9     65  0.0027%
550                              ABG SODIUM     61  0.0025%
551                            ABG CHLOIRDE     61  0.0025%
552                            BloodGlucose     61  0.0025%
553                      T. Protein (5-7.5)     61  0.0025%
554                        Pulsus Paradoxus     60  0.0025%
555                  Impaired Skin Depth #8     55  0.0023%
556           Skeletal Traction #1 - Pounds     55  0.0023%
557                   Temporary AV interval     53  0.0022%
558                       Tobramycin/Random     50  0.0021%
559   Intra Cranial Pressure #2 Alarm - Low     47  0.0019%
560                             Procan Napa     44  0.0018%
561                                  Procan     44  0.0018%
562        Dwell Time (Peritoneal Dialysis)     42  0.0017%
563                          Volume In (PD)     42  0.0017%
564                                 Ethanol     40  0.0016%
565                              Doppler BP     40  0.0016%
566                 Impaired Skin Depth #10     36  0.0015%
567                        Urine Leukocytes     26  0.0011%
568                                pH (Art)     26  0.0011%
569                           Urine Ketones     26  0.0011%
570                           ABG Potassium     18  0.0007%
571               Intra Cranial Pressure #2     18  0.0007%
572  Intra Cranial Pressure #2 Alarm - High     16  0.0007%
573                              SOFA Score     10  0.0004%
574                       ABI Brachial BP L      8  0.0003%
575                          ABI Ankle BP L      8  0.0003%
576                       ABI Brachial BP R      8  0.0003%
577                          ABI Ankle BP R      8  0.0003%

[578 rows x 3 columns]



In [13]:

    
patients_pn['label'].value_counts().plot(kind='bar')









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f998dfc88>

Ab_name indicates which antibiotic is sensitive the microorganism, this field together with the interpretation indicates if the microorganism the degree of resistance of this one to the antibiotic., the main antibiotics evaluated are gentamicin, trimethoprim/sulfa, levofloxacin, ceftazidime, tobramycin, cefepime, ciprofloxacin, meropenem, erythromycin, oxacillin, vancomycin, ceftriaxone, tetracycline, clindamycin, piperacillin/tazo, which represent 80% of the sample.



In [14]:

    
patients_pn['ab_name'].unique()









    Out[14]:





array(['TETRACYCLINE', 'RIFAMPIN', 'ERYTHROMYCIN', 'GENTAMICIN',
       'VANCOMYCIN', 'OXACILLIN', 'LEVOFLOXACIN', 'PENICILLIN',
       'AMPICILLIN', 'NITROFURANTOIN', 'LINEZOLID', 'CEFTAZIDIME',
       'TOBRAMYCIN', 'CEFAZOLIN', 'CEFTRIAXONE', 'CIPROFLOXACIN',
       'AMPICILLIN/SULBACTAM', 'PIPERACILLIN/TAZO', 'CEFEPIME',
       'MEROPENEM', 'TRIMETHOPRIM/SULFA', 'CEFUROXIME', 'AMIKACIN',
       'PIPERACILLIN', 'CHLORAMPHENICOL', 'CLINDAMYCIN', 'IMIPENEM',
       'DAPTOMYCIN', 'PENICILLIN G'], dtype=object)



In [15]:

    
patients_pn['ab_name'].value_counts()









    Out[15]:





GENTAMICIN              250109
TRIMETHOPRIM/SULFA      196366
LEVOFLOXACIN            140174
CEFTAZIDIME             136404
TOBRAMYCIN              132454
CEFEPIME                130684
CIPROFLOXACIN           130334
MEROPENEM               126236
ERYTHROMYCIN            119631
OXACILLIN               118439
VANCOMYCIN              106293
CEFTRIAXONE              95930
TETRACYCLINE             92353
CLINDAMYCIN              92223
PIPERACILLIN/TAZO        91957
RIFAMPIN                 73251
AMPICILLIN/SULBACTAM     72894
CEFAZOLIN                65303
AMPICILLIN               50295
PENICILLIN G             38486
PIPERACILLIN             31307
NITROFURANTOIN           28717
CEFUROXIME               28239
AMIKACIN                 23785
PENICILLIN               20394
IMIPENEM                 17084
LINEZOLID                14807
DAPTOMYCIN                5239
CHLORAMPHENICOL           1252
Name: ab_name, dtype: int64



In [16]:

    
patients_ab_name = patients_pn['ab_name'].value_counts().reset_index()
patients_ab_name.columns=['ab_name','Count']
patients_ab_name['Count'].apply('{:,.2f}'.format)
patients_ab_name['cum_perc'] = 100*patients_ab_name.Count/patients_ab_name.Count.sum()
patients_ab_name['cum_perc'] = patients_ab_name ['cum_perc'].map('{:,.4f}%'.format)

print (patients_ab_name)









    



                 ab_name   Count  cum_perc
0             GENTAMICIN  250109  10.2898%
1     TRIMETHOPRIM/SULFA  196366   8.0788%
2           LEVOFLOXACIN  140174   5.7670%
3            CEFTAZIDIME  136404   5.6119%
4             TOBRAMYCIN  132454   5.4493%
5               CEFEPIME  130684   5.3765%
6          CIPROFLOXACIN  130334   5.3621%
7              MEROPENEM  126236   5.1935%
8           ERYTHROMYCIN  119631   4.9218%
9              OXACILLIN  118439   4.8727%
10            VANCOMYCIN  106293   4.3730%
11           CEFTRIAXONE   95930   3.9467%
12          TETRACYCLINE   92353   3.7995%
13           CLINDAMYCIN   92223   3.7942%
14     PIPERACILLIN/TAZO   91957   3.7832%
15              RIFAMPIN   73251   3.0137%
16  AMPICILLIN/SULBACTAM   72894   2.9990%
17             CEFAZOLIN   65303   2.6867%
18            AMPICILLIN   50295   2.0692%
19          PENICILLIN G   38486   1.5834%
20          PIPERACILLIN   31307   1.2880%
21        NITROFURANTOIN   28717   1.1815%
22            CEFUROXIME   28239   1.1618%
23              AMIKACIN   23785   0.9785%
24            PENICILLIN   20394   0.8390%
25              IMIPENEM   17084   0.7029%
26             LINEZOLID   14807   0.6092%
27            DAPTOMYCIN    5239   0.2155%
28       CHLORAMPHENICOL    1252   0.0515%



In [17]:

    
patients_pn['ab_name'].value_counts().plot(kind='bar')









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x22fa8bf5160>

org_name has the microorganisms name are present in pneumonia patients. The main organims found in this patients are staph aureus coag +, klebsiella pneumoniae, escherichia coli, pseudomonas aeruginosa, staphylococcus, coagulase negative, klebsiella oxytoca, enterococcus sp., acinetobacter baumannii complex, serratia marcescens, enterobacter cloacae.



In [18]:

    
patients_pn['org_name'].unique()









    Out[18]:





array(['STAPH AUREUS COAG +', 'ENTEROCOCCUS SP.', 'KLEBSIELLA PNEUMONIAE',
       'ESCHERICHIA COLI', 'ENTEROCOCCUS FAECIUM',
       'STAPHYLOCOCCUS, COAGULASE NEGATIVE',
       'STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA',
       'ACINETOBACTER BAUMANNII', 'STREPTOCOCCUS PNEUMONIAE',
       'PSEUDOMONAS AERUGINOSA', 'ENTEROCOCCUS FAECALIS',
       'ENTEROBACTER CLOACAE', 'NON-FERMENTER, NOT PSEUDOMONAS AERUGINOSA',
       'KLEBSIELLA OXYTOCA', 'KLEBSIELLA SPECIES', 'PROTEUS MIRABILIS',
       'ENTEROBACTER AEROGENES', 'SERRATIA MARCESCENS',
       'CITROBACTER KOSERI', 'ACINETOBACTER BAUMANNII COMPLEX',
       'CITROBACTER FREUNDII COMPLEX', 'GRAM NEGATIVE ROD #2',
       'POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS',
       'BURKHOLDERIA (PSEUDOMONAS) CEPACIA', 'HAFNIA ALVEI',
       'MORGANELLA MORGANII', 'PROTEUS VULGARIS', 'CITROBACTER SPECIES',
       'STAPHYLOCOCCUS HOMINIS', 'ENTEROBACTER ASBURIAE',
       'STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP',
       'BURKHOLDERIA CEPACIA GROUP', 'PANTOEA SPECIES',
       'ENTEROBACTER GERGOVIAE', 'ENTEROBACTER SPECIES',
       'ENTEROBACTER CLOACAE COMPLEX', 'STAPHYLOCOCCUS EPIDERMIDIS',
       'CORYNEBACTERIUM SPECIES (DIPHTHEROIDS)', 'PROTEUS VULGARIS GROUP',
       'GRAM NEGATIVE ROD(S)', 'STAPHYLOCOCCUS LUGDUNENSIS',
       'PSEUDOMONAS PUTIDA/FLUORESCENS', 'AEROMONAS SPECIES',
       'VIRIDANS STREPTOCOCCI', 'ENTEROBACTER SAKAZAKII',
       'AEROMONAS HYDROPHILA',
       'HAEMOPHILUS INFLUENZAE, BETA-LACTAMASE NEGATIVE',
       'PSEUDOMONAS FLUORESCENS', 'CHRYSEOBACTERIUM INDOLOGENES',
       'BETA STREPTOCOCCUS GROUP B', 'LACTOBACILLUS SPECIES',
       'PROVIDENCIA STUARTII', 'SERRATIA LIQUEFACIENS',
       'ALCALIGENES (ACHROMOBACTER) SPECIES'], dtype=object)



In [19]:

    
patients_pn['org_name'].value_counts()









    Out[19]:





STAPH AUREUS COAG +                                758120
KLEBSIELLA PNEUMONIAE                              369507
ESCHERICHIA COLI                                   304936
PSEUDOMONAS AERUGINOSA                             251556
STAPHYLOCOCCUS, COAGULASE NEGATIVE                 123864
KLEBSIELLA OXYTOCA                                  78064
ENTEROCOCCUS SP.                                    66151
ACINETOBACTER BAUMANNII COMPLEX                     59623
SERRATIA MARCESCENS                                 59299
ENTEROBACTER CLOACAE                                56678
PROTEUS MIRABILIS                                   38983
ENTEROBACTER AEROGENES                              38206
STREPTOCOCCUS PNEUMONIAE                            37870
NON-FERMENTER, NOT PSEUDOMONAS AERUGINOSA           28490
ENTEROCOCCUS FAECIUM                                26675
STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA          16186
POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS     15453
CITROBACTER FREUNDII COMPLEX                        12068
STAPHYLOCOCCUS EPIDERMIDIS                          10355
CITROBACTER KOSERI                                   9467
ENTEROBACTER CLOACAE COMPLEX                         7277
ACINETOBACTER BAUMANNII                              5148
HAFNIA ALVEI                                         4130
GRAM NEGATIVE ROD(S)                                 3769
ENTEROCOCCUS FAECALIS                                3576
AEROMONAS SPECIES                                    3484
PROTEUS VULGARIS                                     3438
PSEUDOMONAS FLUORESCENS                              3340
MORGANELLA MORGANII                                  3227
STAPHYLOCOCCUS HOMINIS                               3000
STAPHYLOCOCCUS LUGDUNENSIS                           2310
BURKHOLDERIA CEPACIA GROUP                           2308
ALCALIGENES (ACHROMOBACTER) SPECIES                  1976
PSEUDOMONAS PUTIDA/FLUORESCENS                       1848
GRAM NEGATIVE ROD #2                                 1848
ENTEROBACTER SPECIES                                 1760
ENTEROBACTER SAKAZAKII                               1575
AEROMONAS HYDROPHILA                                 1443
CHRYSEOBACTERIUM INDOLOGENES                         1368
BURKHOLDERIA (PSEUDOMONAS) CEPACIA                   1312
PROVIDENCIA STUARTII                                 1309
CITROBACTER SPECIES                                  1236
PANTOEA SPECIES                                      1216
ENTEROBACTER ASBURIAE                                1200
PROTEUS VULGARIS GROUP                               1099
ENTEROBACTER GERGOVIAE                               1035
SERRATIA LIQUEFACIENS                                 776
STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP               624
CORYNEBACTERIUM SPECIES (DIPHTHEROIDS)                592
KLEBSIELLA SPECIES                                    576
VIRIDANS STREPTOCOCCI                                 508
LACTOBACILLUS SPECIES                                 399
BETA STREPTOCOCCUS GROUP B                            278
HAEMOPHILUS INFLUENZAE, BETA-LACTAMASE NEGATIVE       104
Name: org_name, dtype: int64



In [20]:

    
patients_pn['org_name'].value_counts().plot(kind='bar')









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x22ffb1204a8>



In [21]:

    
patients_org_name = patients_pn['org_name'].value_counts().reset_index()
patients_org_name.columns=['org_name','Count']
patients_org_name['Count'].apply('{:,.2f}'.format)
patients_org_name['cum_perc'] = 100*patients_org_name.Count/patients_org_name.Count.sum()
patients_org_name['cum_perc'] = patients_org_name['cum_perc'].map('{:,.4f}%'.format)

print (patients_org_name)









    



                                           org_name   Count  cum_perc
0                               STAPH AUREUS COAG +  758120  31.1901%
1                             KLEBSIELLA PNEUMONIAE  369507  15.2020%
2                                  ESCHERICHIA COLI  304936  12.5455%
3                            PSEUDOMONAS AERUGINOSA  251556  10.3494%
4                STAPHYLOCOCCUS, COAGULASE NEGATIVE  123864   5.0959%
5                                KLEBSIELLA OXYTOCA   78064   3.2117%
6                                  ENTEROCOCCUS SP.   66151   2.7215%
7                   ACINETOBACTER BAUMANNII COMPLEX   59623   2.4530%
8                               SERRATIA MARCESCENS   59299   2.4396%
9                              ENTEROBACTER CLOACAE   56678   2.3318%
10                                PROTEUS MIRABILIS   38983   1.6038%
11                           ENTEROBACTER AEROGENES   38206   1.5718%
12                         STREPTOCOCCUS PNEUMONIAE   37870   1.5580%
13        NON-FERMENTER, NOT PSEUDOMONAS AERUGINOSA   28490   1.1721%
14                             ENTEROCOCCUS FAECIUM   26675   1.0974%
15       STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA   16186   0.6659%
16  POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS   15453   0.6358%
17                     CITROBACTER FREUNDII COMPLEX   12068   0.4965%
18                       STAPHYLOCOCCUS EPIDERMIDIS   10355   0.4260%
19                               CITROBACTER KOSERI    9467   0.3895%
20                     ENTEROBACTER CLOACAE COMPLEX    7277   0.2994%
21                          ACINETOBACTER BAUMANNII    5148   0.2118%
22                                     HAFNIA ALVEI    4130   0.1699%
23                             GRAM NEGATIVE ROD(S)    3769   0.1551%
24                            ENTEROCOCCUS FAECALIS    3576   0.1471%
25                                AEROMONAS SPECIES    3484   0.1433%
26                                 PROTEUS VULGARIS    3438   0.1414%
27                          PSEUDOMONAS FLUORESCENS    3340   0.1374%
28                              MORGANELLA MORGANII    3227   0.1328%
29                           STAPHYLOCOCCUS HOMINIS    3000   0.1234%
30                       STAPHYLOCOCCUS LUGDUNENSIS    2310   0.0950%
31                       BURKHOLDERIA CEPACIA GROUP    2308   0.0950%
32              ALCALIGENES (ACHROMOBACTER) SPECIES    1976   0.0813%
33                   PSEUDOMONAS PUTIDA/FLUORESCENS    1848   0.0760%
34                             GRAM NEGATIVE ROD #2    1848   0.0760%
35                             ENTEROBACTER SPECIES    1760   0.0724%
36                           ENTEROBACTER SAKAZAKII    1575   0.0648%
37                             AEROMONAS HYDROPHILA    1443   0.0594%
38                     CHRYSEOBACTERIUM INDOLOGENES    1368   0.0563%
39               BURKHOLDERIA (PSEUDOMONAS) CEPACIA    1312   0.0540%
40                             PROVIDENCIA STUARTII    1309   0.0539%
41                              CITROBACTER SPECIES    1236   0.0509%
42                                  PANTOEA SPECIES    1216   0.0500%
43                            ENTEROBACTER ASBURIAE    1200   0.0494%
44                           PROTEUS VULGARIS GROUP    1099   0.0452%
45                           ENTEROBACTER GERGOVIAE    1035   0.0426%
46                            SERRATIA LIQUEFACIENS     776   0.0319%
47          STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP     624   0.0257%
48           CORYNEBACTERIUM SPECIES (DIPHTHEROIDS)     592   0.0244%
49                               KLEBSIELLA SPECIES     576   0.0237%
50                            VIRIDANS STREPTOCOCCI     508   0.0209%
51                            LACTOBACILLUS SPECIES     399   0.0164%
52                       BETA STREPTOCOCCUS GROUP B     278   0.0114%
53  HAEMOPHILUS INFLUENZAE, BETA-LACTAMASE NEGATIVE     104   0.0043%

interpretation indicates the results of the test, “S” when the antibiotic is sensitive, “R” when is resistant, “I” when the antibiotic is intermediate, and “P” when is pending.



In [22]:

    
patients_pn['interpretation'].unique()









    Out[22]:





array(['S', 'R', 'I', 'P'], dtype=object)



In [23]:

    
patients_pn['interpretation'].value_counts()









    Out[23]:





S    1617210
R     732254
I      81132
P         44
Name: interpretation, dtype: int64



In [24]:

    
patients_interpretation = patients_pn['interpretation'].value_counts().reset_index()
patients_interpretation.columns=['interpretation','Count']
patients_interpretation['Count'].apply('{:,.2f}'.format)
patients_interpretation['cum_perc'] = 100*patients_interpretation.Count/patients_interpretation.Count.sum()
patients_interpretation['cum_perc'] = patients_interpretation ['cum_perc'].map('{:,.4f}%'.format)

print (patients_interpretation)









    



  interpretation    Count  cum_perc
0              S  1617210  66.5343%
1              R   732254  30.1260%
2              I    81132   3.3379%
3              P       44   0.0018%



In [25]:

    
patients_pn['interpretation'].value_counts().plot(kind='bar')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x22ff6f395c0>



In [26]:

    
patients_pn.head()









    Out[26]:






  
    
      
      hospital_expire_flag
      subject_id
      gender
      last_admit_age
      age_group
      category
      label
      valuenum_avg
      org_name
      ab_name
      interpretation
    
  
  
    
      0
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      TETRACYCLINE
      S
    
    
      1
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      RIFAMPIN
      S
    
    
      2
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      ERYTHROMYCIN
      R
    
    
      3
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      GENTAMICIN
      S
    
    
      4
      0
      157
      M
      80.54
      elderly
      ABG
      Arterial Base Excess
      -3.75
      STAPH AUREUS COAG +
      VANCOMYCIN
      S

To transform the matrix in a pivot table, the first step is transform some categorical variables as dummy variables. The chosen variables are gender, age_group, category, label, org_name, ab_name, and interpretation. This operation was done with pandas get_dummies command. The result of this transformation is a panda data frame with shape 2,430,640 rows by 716 columns, these new columns are binaries variables and only take the number 1 once the categorical effect happened.



In [27]:

    
patients_dummy = pd.get_dummies(patients_pn,prefix=['gender', 'age_group', 'category','label', 
                                                    'org_name','ab_name', 'interpretation'])
patients_dummy.head()









    Out[27]:






  
    
      
      hospital_expire_flag
      subject_id
      last_admit_age
      valuenum_avg
      gender_F
      gender_M
      age_group_adult
      age_group_elderly
      age_group_neonate
      age_group_oldest old
      ...
      ab_name_PIPERACILLIN/TAZO
      ab_name_RIFAMPIN
      ab_name_TETRACYCLINE
      ab_name_TOBRAMYCIN
      ab_name_TRIMETHOPRIM/SULFA
      ab_name_VANCOMYCIN
      interpretation_I
      interpretation_P
      interpretation_R
      interpretation_S
    
  
  
    
      0
      0
      157
      80.54
      -3.75
      0
      1
      0
      1
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
    
    
      1
      0
      157
      80.54
      -3.75
      0
      1
      0
      1
      0
      0
      ...
      0
      1
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2
      0
      157
      80.54
      -3.75
      0
      1
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      3
      0
      157
      80.54
      -3.75
      0
      1
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      4
      0
      157
      80.54
      -3.75
      0
      1
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      1
    
  

5 rows × 716 columns

The next step, is to transform the matrix into a PivotTable, the purpose of this transformation is to be able to have the medical data in individual lines per subject and numerical form.
To do that, I employed pandas pivot_table command, and using as indexes subject_id and hospital_expire_flag. With this transformation, the resulting panda data frame has 829 rows by 724 columns. The data transformed in this form allow apply the classifier models to this data.



In [28]:

    
patients_data = pd.pivot_table(patients_dummy,index=["subject_id", "hospital_expire_flag" ])
process = time.process_time()
print (process)



In [29]:

    
patients_data.head()









    Out[29]:






  
    
      
      
      ab_name_AMIKACIN
      ab_name_AMPICILLIN
      ab_name_AMPICILLIN/SULBACTAM
      ab_name_CEFAZOLIN
      ab_name_CEFEPIME
      ab_name_CEFTAZIDIME
      ab_name_CEFTRIAXONE
      ab_name_CEFUROXIME
      ab_name_CHLORAMPHENICOL
      ab_name_CIPROFLOXACIN
      ...
      org_name_STAPH AUREUS COAG +
      org_name_STAPHYLOCOCCUS EPIDERMIDIS
      org_name_STAPHYLOCOCCUS HOMINIS
      org_name_STAPHYLOCOCCUS LUGDUNENSIS
      org_name_STAPHYLOCOCCUS, COAGULASE NEGATIVE
      org_name_STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA
      org_name_STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP
      org_name_STREPTOCOCCUS PNEUMONIAE
      org_name_VIRIDANS STREPTOCOCCI
      valuenum_avg
    
    
      subject_id
      hospital_expire_flag
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      41
      0
      0.0
      0.018182
      0.036364
      0.036364
      0.036364
      0.036364
      0.036364
      0.036364
      0.0
      0.018182
      ...
      0.236364
      0.0
      0.0
      0
      0.163636
      0.072727
      0.0
      0.0
      0
      36.219025
    
    
      157
      0
      0.0
      0.076923
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.0
      0.000000
      ...
      0.615385
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      44.302839
    
    
      177
      0
      0.0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.0
      0.000000
      ...
      1.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      39.911055
    
    
      203
      0
      0.0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.0
      0.000000
      ...
      1.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      53.690473
    
    
      236
      0
      0.0
      0.000000
      0.090909
      0.090909
      0.090909
      0.090909
      0.090909
      0.000000
      0.0
      0.090909
      ...
      0.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      69.293584
    
  

5 rows × 714 columns



In [30]:

    
patients_data.info()









    



<class 'pandas.core.frame.DataFrame'>
MultiIndex: 829 entries, (41, 0) to (99985, 0)
Columns: 714 entries, ab_name_AMIKACIN to valuenum_avg
dtypes: float64(703), uint8(11)
memory usage: 4.5+ MB



In [31]:

    
patients = patients_data.reset_index()



In [32]:

    
patients.head()









    Out[32]:






  
    
      
      subject_id
      hospital_expire_flag
      ab_name_AMIKACIN
      ab_name_AMPICILLIN
      ab_name_AMPICILLIN/SULBACTAM
      ab_name_CEFAZOLIN
      ab_name_CEFEPIME
      ab_name_CEFTAZIDIME
      ab_name_CEFTRIAXONE
      ab_name_CEFUROXIME
      ...
      org_name_STAPH AUREUS COAG +
      org_name_STAPHYLOCOCCUS EPIDERMIDIS
      org_name_STAPHYLOCOCCUS HOMINIS
      org_name_STAPHYLOCOCCUS LUGDUNENSIS
      org_name_STAPHYLOCOCCUS, COAGULASE NEGATIVE
      org_name_STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA
      org_name_STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP
      org_name_STREPTOCOCCUS PNEUMONIAE
      org_name_VIRIDANS STREPTOCOCCI
      valuenum_avg
    
  
  
    
      0
      41
      0
      0.0
      0.018182
      0.036364
      0.036364
      0.036364
      0.036364
      0.036364
      0.036364
      ...
      0.236364
      0.0
      0.0
      0
      0.163636
      0.072727
      0.0
      0.0
      0
      36.219025
    
    
      1
      157
      0
      0.0
      0.076923
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.615385
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      44.302839
    
    
      2
      177
      0
      0.0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      1.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      39.911055
    
    
      3
      203
      0
      0.0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      1.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      53.690473
    
    
      4
      236
      0
      0.0
      0.000000
      0.090909
      0.090909
      0.090909
      0.090909
      0.090909
      0.000000
      ...
      0.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      69.293584
    
  

5 rows × 716 columns



In [33]:

    
p_data= patients.ix[:,2:]



In [34]:

    
p_data.head()









    Out[34]:






  
    
      
      ab_name_AMIKACIN
      ab_name_AMPICILLIN
      ab_name_AMPICILLIN/SULBACTAM
      ab_name_CEFAZOLIN
      ab_name_CEFEPIME
      ab_name_CEFTAZIDIME
      ab_name_CEFTRIAXONE
      ab_name_CEFUROXIME
      ab_name_CHLORAMPHENICOL
      ab_name_CIPROFLOXACIN
      ...
      org_name_STAPH AUREUS COAG +
      org_name_STAPHYLOCOCCUS EPIDERMIDIS
      org_name_STAPHYLOCOCCUS HOMINIS
      org_name_STAPHYLOCOCCUS LUGDUNENSIS
      org_name_STAPHYLOCOCCUS, COAGULASE NEGATIVE
      org_name_STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA
      org_name_STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP
      org_name_STREPTOCOCCUS PNEUMONIAE
      org_name_VIRIDANS STREPTOCOCCI
      valuenum_avg
    
  
  
    
      0
      0.0
      0.018182
      0.036364
      0.036364
      0.036364
      0.036364
      0.036364
      0.036364
      0.0
      0.018182
      ...
      0.236364
      0.0
      0.0
      0
      0.163636
      0.072727
      0.0
      0.0
      0
      36.219025
    
    
      1
      0.0
      0.076923
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.0
      0.000000
      ...
      0.615385
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      44.302839
    
    
      2
      0.0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.0
      0.000000
      ...
      1.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      39.911055
    
    
      3
      0.0
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.0
      0.000000
      ...
      1.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      53.690473
    
    
      4
      0.0
      0.000000
      0.090909
      0.090909
      0.090909
      0.090909
      0.090909
      0.000000
      0.0
      0.090909
      ...
      0.000000
      0.0
      0.0
      0
      0.000000
      0.000000
      0.0
      0.0
      0
      69.293584
    
  

5 rows × 714 columns



In [35]:

    
p_target= patients['hospital_expire_flag']



In [36]:

    
p_target.head()









    Out[36]:





0    0
1    0
2    0
3    0
4    0
Name: hospital_expire_flag, dtype: int64

In all models, the variable dependent is survival state (Alive / Deceased). In order to sub-setting the data I work with a test size of 25% of the sample, I chose this value after some essays, a higher percentage could lower the computer velocity, and a higher value could make the results will be spurious or meaningless.



In [37]:

    
X_train, X_test, y_train, y_test = train_test_split(
    p_data, p_target, test_size=0.25, random_state=123)



In [38]:

    
X_train.head()









    Out[38]:






  
    
      
      ab_name_AMIKACIN
      ab_name_AMPICILLIN
      ab_name_AMPICILLIN/SULBACTAM
      ab_name_CEFAZOLIN
      ab_name_CEFEPIME
      ab_name_CEFTAZIDIME
      ab_name_CEFTRIAXONE
      ab_name_CEFUROXIME
      ab_name_CHLORAMPHENICOL
      ab_name_CIPROFLOXACIN
      ...
      org_name_STAPH AUREUS COAG +
      org_name_STAPHYLOCOCCUS EPIDERMIDIS
      org_name_STAPHYLOCOCCUS HOMINIS
      org_name_STAPHYLOCOCCUS LUGDUNENSIS
      org_name_STAPHYLOCOCCUS, COAGULASE NEGATIVE
      org_name_STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA
      org_name_STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP
      org_name_STREPTOCOCCUS PNEUMONIAE
      org_name_VIRIDANS STREPTOCOCCI
      valuenum_avg
    
  
  
    
      121
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.0
      0.0
      0.000000
      ...
      1.000000
      0.0
      0.0
      0
      0.000000
      0.0
      0.0
      0.000000
      0
      42.223256
    
    
      498
      0.000000
      0.022222
      0.022222
      0.022222
      0.088889
      0.088889
      0.088889
      0.0
      0.0
      0.088889
      ...
      0.155556
      0.0
      0.0
      0
      0.000000
      0.0
      0.0
      0.044444
      0
      233.596849
    
    
      161
      0.000000
      0.035714
      0.035714
      0.035714
      0.035714
      0.035714
      0.035714
      0.0
      0.0
      0.035714
      ...
      0.535714
      0.0
      0.0
      0
      0.000000
      0.0
      0.0
      0.000000
      0
      320.200094
    
    
      512
      0.076923
      0.076923
      0.076923
      0.076923
      0.076923
      0.076923
      0.076923
      0.0
      0.0
      0.076923
      ...
      0.000000
      0.0
      0.0
      0
      0.000000
      0.0
      0.0
      0.000000
      0
      62.870238
    
    
      209
      0.000000
      0.000000
      0.000000
      0.000000
      0.098592
      0.098592
      0.000000
      0.0
      0.0
      0.098592
      ...
      0.084507
      0.0
      0.0
      0
      0.126761
      0.0
      0.0
      0.000000
      0
      110.258445
    
  

5 rows × 714 columns



In [39]:

    
X_train.shape, y_train.shape









    Out[39]:





((621, 714), (621,))



In [40]:

    
X_test.shape, y_test.shape









    Out[40]:





((208, 714), (208,))

Support Vector Machine: Machine Support Vector (SVM) is a classification method that separates a sample points in different hyperplanes in multidimensional spaces, which are separated by different labels. The algorithm seeks to group the data (classify) by optimal search (minimum distance) between the hyperplanes, the resulting vectors are called support vectors . The optimization is made over kernels (mathematical functions), in this analysis I used different methods: linear, radial basis function (rbf) and sigmoid. I purposely avoided using the polynomial kernels, more because of parameterization problems that did not allow me to run the data with this algorithm.



In [41]:

    
# the same model as above, only has changed the jobs

clf_SVC = SVC(kernel='linear', C=1)
scores_SVC = cross_val_score(clf_SVC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_SVC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_SVC.mean(), scores_SVC.std() * 2))
clf_SVC = clf_SVC.fit(X_train, y_train)
y_predicted_SVC = clf_SVC.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_SVC))

process = time.process_time()
print (process)









    



[ 0.72435897  0.71612903  0.70322581  0.72258065]
Accuracy: 0.7166 (+/- 0.0166)
             precision    recall  f1-score   support

          0       0.74      1.00      0.85       154
          1       0.00      0.00      0.00        54

avg / total       0.55      0.74      0.63       208

308.734375






    



C:\Users\pc1\Anaconda3\envs\tensorflow\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [42]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_SVC), annot = True, fmt = '', cmap = "GnBu")









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f8aaa4278>



In [43]:

    
print ("Fitting the Support Vector Classification - kernel Radial Basis Function classifier to the training set")

param_grid = {
          'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5],
          'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_RBF = GridSearchCV(SVC(kernel='rbf', class_weight='balanced', cache_size=1000), param_grid)
clf_RBF = clf_RBF.fit(X_train, y_train)

scores_RBF = cross_val_score(clf_RBF, X_train, y_train, cv=4, n_jobs=-1)
print(scores_RBF)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores_RBF.mean(), scores_RBF.std() * 2))
print ("Best estimator found by grid search:")
print (clf_RBF.best_estimator_)

y_predicted_RBF = clf_RBF.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_RBF))

process = time.process_time()
print (process)









    



Fitting the Support Vector Classification - kernel Radial Basis Function classifier to the training set
[ 0.71794872  0.72258065  0.72258065  0.72258065]
Accuracy: 0.72 (+/- 0.00)
Best estimator found by grid search:
SVC(C=1, cache_size=1000, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
             precision    recall  f1-score   support

          0       0.75      0.99      0.85       154
          1       0.50      0.04      0.07        54

avg / total       0.68      0.74      0.65       208

405.6875



In [44]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_RBF), annot = True, fmt = '', cmap = "GnBu")









    Out[44]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f8b9bcc88>



In [45]:

    
# Mimic-iii_Model-Pulmonary.ipynb
print ("Fitting the Linear Support Vector Classification - Hingue loss classifier to the training set")

param_grid = {
         'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5],
          #'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_LSCV = GridSearchCV(LinearSVC(C=1, loss= 'hinge'), param_grid, n_jobs=-1)
clf_LSCV = clf_LSCV.fit(X_train, y_train)

scores_LSCV = cross_val_score(clf_LSCV, X_train, y_train, cv=4, n_jobs=-1)
print(scores_LSCV)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_LSCV.mean(), scores_LSCV.std() * 2))

print ("Best estimator found by grid search:")
print (clf_LSCV.best_estimator_)

y_predicted_LSCV = clf_LSCV.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_LSCV))

process = time.process_time()
print (process)









    



Fitting the Linear Support Vector Classification - Hingue loss classifier to the training set
[ 0.72435897  0.72258065  0.72258065  0.72258065]
Accuracy: 0.7230 (+/- 0.0015)
Best estimator found by grid search:
LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)
             precision    recall  f1-score   support

          0       0.74      1.00      0.85       154
          1       0.00      0.00      0.00        54

avg / total       0.55      0.74      0.63       208

406.53125






    



C:\Users\pc1\Anaconda3\envs\tensorflow\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [46]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_LSCV), annot = True, fmt = '', cmap = "GnBu")









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f8c141320>

I was struggling a lot with this model, for that reason I will not use this for capstone project

print ("Fitting the classifier to the training set") from sklearn.tree import DecisionTreeClassifier param_grid = { 'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'degree': [3,4,5] }

for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'

clf_poly = GridSearchCV(SVC(kernel='poly', class_weight='balanced'), param_grid, n_jobs=-1) clf_poly = clf_poly.fit(X_train, y_train)

scores_poly = cross_val_score(clf_poly, X_train, y_train, cv=4,n_jobs=-1) print(scores_poly) print("Accuracy: %0.4f (+/- %0.4f)" % (scores_poly.mean(), scores_poly.std() * 2))

print ("Best estimator found by grid search:") print (clf_poly.bestestimator)

y_predicted_poly = clf_poly.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_poly))

process = time.process_time() print (process)



In [47]:

    
print ("Fitting the Support Vector Classification - kernel Sigmoid classifier to the training set")

param_grid = {
         'C': [1e3, 1e4, 1e5],
         'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
         'coef0':[-1,0,1]
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_SIGMOID = GridSearchCV(SVC(kernel='sigmoid', class_weight='balanced'), param_grid, n_jobs=-1)
clf_SIGMOID = clf_SIGMOID.fit(X_train, y_train)

scores_SIGMOID = cross_val_score(clf_SIGMOID, X_train, y_train, cv=4, n_jobs=-1)
print(scores_SIGMOID)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_SIGMOID.mean(), scores_SIGMOID.std() * 2))

print ("Best estimator found by grid search:")
print (clf_SIGMOID.best_estimator_)

y_predicted_SIGMOID = clf_SIGMOID.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_SIGMOID))

process = time.process_time()
print (process)









    



Fitting the Support Vector Classification - kernel Sigmoid classifier to the training set
[ 0.62179487  0.67096774  0.57419355  0.6       ]
Accuracy: 0.6167 (+/- 0.0711)
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=1,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
             precision    recall  f1-score   support

          0       0.72      0.68      0.70       154
          1       0.22      0.26      0.24        54

avg / total       0.59      0.57      0.58       208

409.609375



In [48]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_SIGMOID), annot = True, fmt = '', cmap = "GnBu")









    Out[48]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f9003b400>

Decision Tree Classifier is a no parametric method that learns through binary decisions that when deployed are forming a decision tree.



In [49]:

    
print ("Fitting the Decision Tree Classifier to the training set")

param_grid = {
         'max_depth': [2, 3, 4, 5, 6,7],
          #'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_DTC = GridSearchCV(DecisionTreeClassifier(criterion='entropy', random_state=123, 
                       class_weight='balanced'), param_grid, n_jobs=-1)

clf_DTC = clf_DTC.fit(X_train, y_train)

scores_DTC = cross_val_score(clf_DTC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_DTC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_DTC.mean(), scores_DTC.std() * 2))

print ("Best estimator found by grid search:")
print (clf_DTC.best_estimator_)
y_predicted_DTC = clf_DTC.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_DTC))
process = time.process_time()
print (process)









    



Fitting the Decision Tree Classifier to the training set
[ 0.60897436  0.6516129   0.61935484  0.70322581]
Accuracy: 0.6458 (+/- 0.0734)
Best estimator found by grid search:
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=7, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best')
             precision    recall  f1-score   support

          0       0.84      0.64      0.73       154
          1       0.39      0.65      0.49        54

avg / total       0.72      0.64      0.67       208

410.328125



In [50]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_DTC), annot = True, fmt = '', cmap = "GnBu")









    Out[50]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f94cb1be0>

Ensemble methods like Random Forest, Extremely Tree and Ada Boost Classifiers. These methods “combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator ”. The first two are in averaging methods, where independent estimators are used over random samples and resulting predictions are averaged, getting as result a lower variance than a single estimator.



In [51]:

    
print ("Fitting the Random Forest Classifier to the training set")

param_grid = {
         'n_estimators' :[3,5,7,10],
         'max_depth': [2, 3, 4, 5, 6,7],
          #'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_RFC = GridSearchCV(RandomForestClassifier(min_samples_split=2, random_state=123, class_weight='balanced'), 
                       param_grid, n_jobs=-1)

clf_RFC = clf_RFC.fit(X_train, y_train)

scores_RFC = cross_val_score(clf_RFC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_RFC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_RFC.mean(), scores_RFC.std() * 2))

print ("Best estimator found by grid search:")
print (clf_RFC.best_estimator_)
y_predicted_RFC = clf_RFC.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_RFC))
process = time.process_time()
print (process)









    



Fitting the Random Forest Classifier to the training set
[ 0.69230769  0.67096774  0.66451613  0.61935484]
Accuracy: 0.6618 (+/- 0.0531)
Best estimator found by grid search:
RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=7, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False)
             precision    recall  f1-score   support

          0       0.79      0.75      0.77       154
          1       0.38      0.44      0.41        54

avg / total       0.69      0.67      0.68       208

411.296875



In [52]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_RFC), annot = True, fmt = '', cmap = "GnBu")









    Out[52]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f9761c898>



In [53]:

    
print ("Fitting the Extremely Tree Classifier to the training set")

param_grid = {
         'n_estimators' :[3,5,10],
         'max_depth': [2, 3, 4, 5, 6,7],
          #'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_EFC = GridSearchCV(ExtraTreesClassifier(min_samples_split=2, random_state=123, class_weight='balanced'), 
                       param_grid, n_jobs=-1)

clf_EFC = clf_EFC.fit(X_train, y_train)

scores_EFC = cross_val_score(clf_EFC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_EFC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_EFC.mean(), scores_EFC.std() * 2))

print ("Best estimator found by grid search:")
print (clf_EFC.best_estimator_)
y_predicted_EFC = clf_EFC.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_EFC))
process = time.process_time()
print (process)









    



Fitting the Extremely Tree Classifier to the training set
[ 0.72435897  0.67741935  0.63870968  0.64516129]
Accuracy: 0.6714 (+/- 0.0678)
Best estimator found by grid search:
ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=5, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False)
             precision    recall  f1-score   support

          0       0.83      0.81      0.82       154
          1       0.48      0.52      0.50        54

avg / total       0.74      0.73      0.73       208

412.1875



In [54]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_EFC), annot = True, fmt = '', cmap = "GnBu")









    Out[54]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f97db2518>



In [55]:

    
print ("Fitting the Ada Boost Classifier to the training set")

param_grid = {
         'n_estimators' :[3,5,10],
         'learning_rate': [0.01],
         #'max_depth': [2, 3, 4, 5, 6,7],
          #'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_ABC = GridSearchCV(AdaBoostClassifier(random_state=123), param_grid, n_jobs=-1)

clf_ABC = clf_ABC.fit(X_train, y_train)

scores_ABC = cross_val_score(clf_ABC, X_train, y_train, cv=4, n_jobs=-1)
print(scores_ABC)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_ABC.mean(), scores_ABC.std() * 2))

print ("Best estimator found by grid search:")
print (clf_ABC.best_estimator_)
y_predicted_ABC = clf_ABC.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_ABC))
process = time.process_time()
print (process)









    



Fitting the Ada Boost Classifier to the training set
[ 0.72435897  0.72258065  0.72258065  0.72258065]
Accuracy: 0.7230 (+/- 0.0015)
Best estimator found by grid search:
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.01, n_estimators=3, random_state=123)
             precision    recall  f1-score   support

          0       0.74      1.00      0.85       154
          1       0.00      0.00      0.00        54

avg / total       0.55      0.74      0.63       208

412.765625






    



C:\Users\pc1\Anaconda3\envs\tensorflow\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [56]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_ABC), annot = True, fmt = '', cmap = "GnBu")









    Out[56]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f9b0cc4a8>

Logistic Regression Classifier is the most traditional method applied to classification problems. Here a logistic probability function is applied to data, and the result obtained is a probability of occurrence of the binary categorical variable



In [57]:

    
print ("Fitting the Logistic Regression Classification - Hingue loss classifier to the training set")

param_grid = {
         'C': [1e-3, 1e-2, 1, 1e3, 5e3, 1e4, 5e4, 1e5],
          #'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf_LOGREG = GridSearchCV(LogisticRegression(random_state=123), param_grid, n_jobs=-1)
clf_LOGREG= clf_LOGREG .fit(X_train, y_train)

scores_LOGREG = cross_val_score(clf_LOGREG, X_train, y_train, cv=4, n_jobs=-1)
print(scores_LOGREG)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_LOGREG.mean(), scores_LOGREG.std() * 2))

print ("Best estimator found by grid search:")
print (clf_LOGREG.best_estimator_)

y_predicted_LOGREG = clf_LOGREG.predict(X_test)

print (metrics.classification_report(y_test, y_predicted_LOGREG))

process = time.process_time()
print (process)









    



Fitting the Logistic Regression Classification - Hingue loss classifier to the training set
[ 0.72435897  0.72258065  0.72258065  0.72258065]
Accuracy: 0.7230 (+/- 0.0015)
Best estimator found by grid search:
LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
             precision    recall  f1-score   support

          0       0.74      1.00      0.85       154
          1       0.00      0.00      0.00        54

avg / total       0.55      0.74      0.63       208

413.46875






    



C:\Users\pc1\Anaconda3\envs\tensorflow\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [58]:

    
sns.heatmap(confusion_matrix(y_test, y_predicted_LOGREG), annot = True, fmt = '', cmap = "GnBu")









    Out[58]:





<matplotlib.axes._subplots.AxesSubplot at 0x22f9b68d198>



In [59]:

    
# Best Models

clf_RBF_b = SVC(C=1, cache_size=200, class_weight='balanced', coef0=0.0,
                decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
                max_iter=-1, probability=False, random_state=None, shrinking=True,
                tol=0.001, verbose=False)

y_predicted_RBF_b = clf_RBF_b.fit(X_train,y_train).predict(X_test)

clf_LSCV_b = LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True,
                       intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
                       penalty='l2', random_state=None, tol=0.0001, verbose=0)

y_predicted_LSCV_b = clf_LSCV_b.fit(X_train,y_train).predict(X_test)

clf_SIGMOID_b = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=1,
                    decision_function_shape=None, degree=3, gamma=0.001, kernel='sigmoid',
                    max_iter=-1, probability=False, random_state=None, shrinking=True,
                    tol=0.001, verbose=False)

y_predicted_SIGMOID_b = clf_SIGMOID_b.fit(X_train,y_train).predict(X_test)

clf_DTC_b = DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                                   max_depth=7, max_features=None, max_leaf_nodes=None,
                                   min_impurity_split=1e-07, min_samples_leaf=1,
                                   min_samples_split=2, min_weight_fraction_leaf=0.0,
                                   presort=False, random_state=123, splitter='best')

y_predicted_DTC_b = clf_DTC_b.fit(X_train,y_train).predict(X_test)

clf_RFC_b = RandomForestClassifier(bootstrap=True, class_weight='balanced',
                                   criterion='gini', max_depth=7, max_features='auto',
                                   max_leaf_nodes=None, min_impurity_split=1e-07,
                                   min_samples_leaf=1, min_samples_split=2,
                                   min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                                   oob_score=False, random_state=123, verbose=0, warm_start=False)

y_predicted_RFC_b = clf_RFC_b.fit(X_train,y_train).predict(X_test)

clf_EFC_b = ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
                                 criterion='gini', max_depth=5, max_features='auto',
                                 max_leaf_nodes=None, min_impurity_split=1e-07,
                                 min_samples_leaf=1, min_samples_split=2,
                                 min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                                 oob_score=False, random_state=123, verbose=0, warm_start=False)

y_predicted_EFC_b = clf_EFC_b.fit(X_train,y_train).predict(X_test)

clf_ABC_b = AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
                               learning_rate=0.01, n_estimators=3, random_state=123)

y_predicted_ABC_b = clf_ABC_b.fit(X_train,y_train).predict(X_test)


clf_LR_b= LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
                             intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
                             penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
                             verbose=0, warm_start=False)

y_predicted_LR_b = clf_LR_b.fit(X_train,y_train).predict(X_test)



In [60]:

    
fig, axes = plt.subplots(2,4)

sns.heatmap(confusion_matrix(y_test, y_predicted_RBF_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 0])
sns.heatmap(confusion_matrix(y_test, y_predicted_LSCV_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 1])
sns.heatmap(confusion_matrix(y_test, y_predicted_SIGMOID_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 2])
sns.heatmap(confusion_matrix(y_test, y_predicted_DTC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[0, 3])
sns.heatmap(confusion_matrix(y_test, y_predicted_RFC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 0])
sns.heatmap(confusion_matrix(y_test, y_predicted_EFC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 1])
sns.heatmap(confusion_matrix(y_test, y_predicted_ABC_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 2])
sns.heatmap(confusion_matrix(y_test, y_predicted_LR_b), annot = True, fmt = '', cmap = "GnBu", ax=axes[1, 3])









    Out[60]:





<matplotlib.axes._subplots.AxesSubplot at 0x23004c2a1d0>

Ensemble voting classifier

All models, this is not a good option, it inherits all the problems of models that do not run well



In [61]:

    
eclf1 = VotingClassifier(estimators=[
    ('rbf',clf_RBF_b), ('LSCV',clf_LSCV_b),
    ('sigmoid',clf_SIGMOID_b), ('DTC',clf_DTC_b),
    ('RFC',clf_RFC_b),('EFC',clf_EFC_b),
    ('ABC',clf_ABC_b), ('svc',clf_LR_b)], 
                         voting='hard')
eclf1 = eclf1.fit(X_train, y_train)
y_predict_eclf1 =  eclf1.predict(X_test)

print (eclf1.get_params(deep=True))
print (eclf1.score(X_train, y_train, sample_weight=None))









    



{'svc__dual': False, 'DTC': DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=7, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'), 'LSCV__penalty': 'l2', 'RFC__n_estimators': 10, 'EFC__n_estimators': 10, 'EFC__class_weight': 'balanced', 'RFC__max_depth': 7, 'sigmoid__decision_function_shape': None, 'sigmoid__probability': False, 'rbf__verbose': False, 'sigmoid__random_state': None, 'sigmoid__cache_size': 200, 'rbf__C': 1, 'rbf__tol': 0.001, 'DTC__splitter': 'best', 'ABC__base_estimator': None, 'LSCV__max_iter': 1000, 'svc__random_state': 123, 'sigmoid__tol': 0.001, 'rbf__decision_function_shape': None, 'rbf__coef0': 0.0, 'DTC__min_samples_leaf': 1, 'svc__intercept_scaling': 1, 'svc__penalty': 'l2', 'LSCV': LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0), 'RFC__random_state': 123, 'EFC__bootstrap': False, 'sigmoid__C': 1000.0, 'n_jobs': 1, 'rbf__cache_size': 200, 'EFC__min_weight_fraction_leaf': 0.0, 'weights': None, 'EFC__oob_score': False, 'RFC__min_samples_split': 2, 'rbf': SVC(C=1, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'voting': 'hard', 'sigmoid__kernel': 'sigmoid', 'svc__multi_class': 'ovr', 'svc__n_jobs': 1, 'estimators': [('rbf', SVC(C=1, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)), ('LSCV', LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)), ('sigmoid', SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=1,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)), ('DTC', DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=7, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best')), ('RFC', RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=7, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False)), ('EFC', ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=5, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False)), ('ABC', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.01, n_estimators=3, random_state=123)), ('svc', LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))], 'DTC__criterion': 'entropy', 'RFC__oob_score': False, 'DTC__max_leaf_nodes': None, 'DTC__presort': False, 'EFC__min_samples_split': 2, 'svc__solver': 'liblinear', 'sigmoid__coef0': 1, 'rbf__shrinking': True, 'svc__max_iter': 100, 'RFC__min_weight_fraction_leaf': 0.0, 'EFC__min_impurity_split': 1e-07, 'LSCV__intercept_scaling': 1, 'svc__warm_start': False, 'rbf__kernel': 'rbf', 'sigmoid__max_iter': -1, 'rbf__probability': False, 'RFC__max_leaf_nodes': None, 'LSCV__multi_class': 'ovr', 'RFC__warm_start': False, 'sigmoid__degree': 3, 'sigmoid__verbose': False, 'LSCV__fit_intercept': True, 'rbf__degree': 3, 'EFC__warm_start': False, 'LSCV__verbose': 0, 'ABC__algorithm': 'SAMME.R', 'RFC__n_jobs': 1, 'sigmoid': SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=1,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'DTC__min_samples_split': 2, 'svc__fit_intercept': True, 'rbf__class_weight': 'balanced', 'rbf__max_iter': -1, 'ABC__random_state': 123, 'RFC__max_features': 'auto', 'LSCV__dual': True, 'EFC__random_state': 123, 'LSCV__loss': 'hinge', 'RFC__bootstrap': True, 'EFC__n_jobs': 1, 'RFC__verbose': 0, 'DTC__max_depth': 7, 'EFC__criterion': 'gini', 'ABC': AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.01, n_estimators=3, random_state=123), 'RFC__criterion': 'gini', 'sigmoid__shrinking': True, 'EFC__verbose': 0, 'svc__class_weight': None, 'DTC__class_weight': 'balanced', 'RFC': RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=7, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False), 'EFC__min_samples_leaf': 1, 'sigmoid__gamma': 0.001, 'LSCV__tol': 0.0001, 'DTC__max_features': None, 'RFC__min_impurity_split': 1e-07, 'EFC__max_leaf_nodes': None, 'DTC__min_weight_fraction_leaf': 0.0, 'RFC__class_weight': 'balanced', 'DTC__random_state': 123, 'DTC__min_impurity_split': 1e-07, 'ABC__learning_rate': 0.01, 'RFC__min_samples_leaf': 1, 'EFC__max_features': 'auto', 'rbf__gamma': 1, 'rbf__random_state': None, 'ABC__n_estimators': 3, 'svc__tol': 0.0001, 'svc__C': 0.001, 'LSCV__random_state': None, 'svc__verbose': 0, 'EFC__max_depth': 5, 'svc': LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False), 'LSCV__C': 0.001, 'sigmoid__class_weight': 'balanced', 'LSCV__class_weight': None, 'EFC': ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=5, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False)}
0.766505636071



In [62]:

    
eclf2 = VotingClassifier(estimators=[
    ('rbf',clf_RBF), ('LSCV',clf_LSCV),
    ('sigmoid',clf_SIGMOID), ('DTC',clf_DTC),
    ('RFC',clf_RFC),('EFC',clf_EFC),
    ('ABC',clf_ABC), ('svc',clf_LOGREG)], 
                         voting='hard')
eclf2 = eclf2.fit(X_train, y_train)
y_predict_eclf2 =  eclf2.predict(X_test)

print (eclf2.get_params(deep=True))
print (eclf2.score(X_train, y_train, sample_weight=None))

#Basically does the same that chose the best models, this function uses the best models too









    



{'rbf__scoring': None, 'DTC__estimator__splitter': 'best', 'svc__estimator': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False), 'rbf__estimator__verbose': False, 'LSCV__estimator__verbose': 0, 'LSCV__estimator__tol': 0.0001, 'DTC__estimator__max_leaf_nodes': None, 'DTC__estimator__max_features': None, 'rbf__estimator': SVC(C=1.0, cache_size=1000, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'sigmoid__refit': True, 'svc__estimator__multi_class': 'ovr', 'LSCV__estimator__multi_class': 'ovr', 'sigmoid__scoring': None, 'RFC__pre_dispatch': '2*n_jobs', 'rbf__estimator__gamma': 'auto', 'RFC__estimator__warm_start': False, 'RFC__estimator__criterion': 'gini', 'EFC__estimator__min_samples_split': 2, 'LSCV__n_jobs': -1, 'RFC__cv': None, 'sigmoid__estimator': SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'EFC__estimator__verbose': 0, 'rbf__estimator__tol': 0.001, 'sigmoid__estimator__tol': 0.001, 'EFC__estimator__criterion': 'gini', 'LSCV': GridSearchCV(cv=None, error_score='raise',
       estimator=LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'ABC__estimator__base_estimator': None, 'ABC__return_train_score': True, 'EFC__estimator__max_leaf_nodes': None, 'DTC__verbose': 0, 'rbf__estimator__shrinking': True, 'ABC__scoring': None, 'RFC__estimator__verbose': 0, 'rbf__estimator__class_weight': 'balanced', 'svc__fit_params': {}, 'rbf': GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=1000, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'rbf__pre_dispatch': '2*n_jobs', 'EFC__estimator__oob_score': False, 'svc__n_jobs': -1, 'LSCV__estimator__penalty': 'l2', 'EFC__scoring': None, 'DTC__estimator__random_state': 123, 'sigmoid__return_train_score': True, 'svc__return_train_score': True, 'EFC__estimator__max_features': 'auto', 'LSCV__refit': True, 'ABC__param_grid': {'n_estimators': [3, 5, 10], 'learning_rate': [0.01]}, 'sigmoid__estimator__degree': 3, 'rbf__estimator__coef0': 0.0, 'DTC__estimator__min_impurity_split': 1e-07, 'rbf__verbose': 0, 'ABC': GridSearchCV(cv=None, error_score='raise',
       estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=123),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'learning_rate': [0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'svc__estimator__fit_intercept': True, 'svc__estimator__solver': 'liblinear', 'RFC__estimator__min_samples_leaf': 1, 'sigmoid__fit_params': {}, 'EFC__refit': True, 'LSCV__pre_dispatch': '2*n_jobs', 'RFC__estimator__n_jobs': 1, 'DTC__return_train_score': True, 'sigmoid__verbose': 0, 'RFC__estimator__min_impurity_split': 1e-07, 'EFC__estimator__min_samples_leaf': 1, 'LSCV__estimator__class_weight': None, 'sigmoid__estimator__C': 1.0, 'svc__error_score': 'raise', 'ABC__n_jobs': -1, 'svc__estimator__penalty': 'l2', 'svc__refit': True, 'svc__pre_dispatch': '2*n_jobs', 'RFC__refit': True, 'LSCV__estimator__random_state': None, 'svc__estimator__warm_start': False, 'svc__estimator__tol': 0.0001, 'svc__iid': True, 'RFC__iid': True, 'sigmoid__estimator__max_iter': -1, 'LSCV__estimator__fit_intercept': True, 'RFC__estimator__max_features': 'auto', 'rbf__estimator__kernel': 'rbf', 'RFC__verbose': 0, 'rbf__n_jobs': 1, 'rbf__estimator__decision_function_shape': None, 'DTC__refit': True, 'DTC__error_score': 'raise', 'svc__estimator__intercept_scaling': 1, 'svc__cv': None, 'LSCV__iid': True, 'EFC__error_score': 'raise', 'rbf__estimator__probability': False, 'LSCV__scoring': None, 'DTC__estimator__presort': False, 'RFC__error_score': 'raise', 'DTC__pre_dispatch': '2*n_jobs', 'LSCV__estimator': LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0), 'RFC': GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 7, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'EFC__estimator': ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False), 'RFC__estimator': RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False), 'EFC__estimator__n_jobs': 1, 'LSCV__cv': None, 'sigmoid__estimator__kernel': 'sigmoid', 'rbf__estimator__degree': 3, 'sigmoid__estimator__cache_size': 200, 'DTC__scoring': None, 'ABC__fit_params': {}, 'svc': GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'svc__estimator__dual': False, 'EFC__return_train_score': True, 'ABC__cv': None, 'RFC__estimator__n_estimators': 10, 'rbf__cv': None, 'RFC__estimator__max_leaf_nodes': None, 'sigmoid__estimator__class_weight': 'balanced', 'svc__estimator__C': 1.0, 'DTC__estimator__min_samples_leaf': 1, 'rbf__estimator__max_iter': -1, 'svc__verbose': 0, 'LSCV__estimator__C': 1, 'ABC__estimator': AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=123), 'EFC__cv': None, 'EFC': GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'rbf__refit': True, 'EFC__iid': True, 'DTC': GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'sigmoid__pre_dispatch': '2*n_jobs', 'rbf__estimator__cache_size': 1000, 'EFC__verbose': 0, 'ABC__estimator__random_state': 123, 'RFC__estimator__oob_score': False, 'EFC__estimator__min_impurity_split': 1e-07, 'RFC__estimator__bootstrap': True, 'sigmoid__estimator__shrinking': True, 'EFC__estimator__random_state': 123, 'sigmoid__cv': None, 'LSCV__error_score': 'raise', 'svc__estimator__n_jobs': 1, 'RFC__estimator__min_samples_split': 2, 'EFC__param_grid': {'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]}, 'RFC__fit_params': {}, 'LSCV__param_grid': {'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0]}, 'rbf__estimator__C': 1.0, 'DTC__estimator': DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'), 'EFC__estimator__n_estimators': 10, 'rbf__return_train_score': True, 'ABC__pre_dispatch': '2*n_jobs', 'sigmoid__estimator__random_state': None, 'rbf__error_score': 'raise', 'ABC__refit': True, 'weights': None, 'DTC__estimator__class_weight': 'balanced', 'EFC__estimator__class_weight': 'balanced', 'svc__estimator__class_weight': None, 'voting': 'hard', 'ABC__estimator__learning_rate': 1.0, 'estimators': [('rbf', GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=1000, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('LSCV', GridSearchCV(cv=None, error_score='raise',
       estimator=LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('sigmoid', GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [1000.0, 10000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'coef0': [-1, 0, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('DTC', GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('RFC', GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 7, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('EFC', GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('ABC', GridSearchCV(cv=None, error_score='raise',
       estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=123),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'learning_rate': [0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('svc', GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0))], 'LSCV__estimator__max_iter': 1000, 'EFC__fit_params': {}, 'svc__estimator__max_iter': 100, 'EFC__estimator__max_depth': None, 'LSCV__estimator__intercept_scaling': 1, 'ABC__iid': True, 'sigmoid__estimator__probability': False, 'DTC__n_jobs': -1, 'rbf__estimator__random_state': None, 'EFC__estimator__min_weight_fraction_leaf': 0.0, 'RFC__scoring': None, 'svc__param_grid': {'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0]}, 'svc__estimator__random_state': 123, 'sigmoid__estimator__decision_function_shape': None, 'LSCV__return_train_score': True, 'DTC__cv': None, 'DTC__estimator__min_weight_fraction_leaf': 0.0, 'RFC__return_train_score': True, 'DTC__estimator__criterion': 'entropy', 'LSCV__fit_params': {}, 'sigmoid__error_score': 'raise', 'LSCV__verbose': 0, 'rbf__iid': True, 'svc__estimator__verbose': 0, 'n_jobs': 1, 'DTC__estimator__min_samples_split': 2, 'ABC__estimator__algorithm': 'SAMME.R', 'DTC__estimator__max_depth': None, 'RFC__estimator__max_depth': None, 'sigmoid__estimator__gamma': 'auto', 'ABC__verbose': 0, 'LSCV__estimator__loss': 'hinge', 'sigmoid__n_jobs': -1, 'sigmoid__estimator__verbose': False, 'sigmoid': GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [1000.0, 10000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'coef0': [-1, 0, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'ABC__estimator__n_estimators': 50, 'DTC__fit_params': {}, 'EFC__n_jobs': -1, 'sigmoid__iid': True, 'rbf__fit_params': {}, 'RFC__estimator__class_weight': 'balanced', 'sigmoid__param_grid': {'C': [1000.0, 10000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'coef0': [-1, 0, 1]}, 'DTC__param_grid': {'max_depth': [2, 3, 4, 5, 6, 7]}, 'DTC__iid': True, 'LSCV__estimator__dual': True, 'ABC__error_score': 'raise', 'rbf__param_grid': {'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1]}, 'RFC__estimator__min_weight_fraction_leaf': 0.0, 'svc__scoring': None, 'RFC__n_jobs': -1, 'EFC__estimator__warm_start': False, 'RFC__param_grid': {'n_estimators': [3, 5, 7, 10], 'max_depth': [2, 3, 4, 5, 6, 7]}, 'RFC__estimator__random_state': 123, 'sigmoid__estimator__coef0': 0.0, 'EFC__estimator__bootstrap': False, 'EFC__pre_dispatch': '2*n_jobs'}
0.766505636071



In [63]:

    
# Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')



In [64]:

    
cnf_matrix_RBF = confusion_matrix(y_test, y_predicted_RBF_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RBF, classes= '1',
                      title='RBF Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RBF, classes='1', normalize=True,
                      title='RBF Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[152   2]
 [ 52   2]]
Normalized confusion matrix
[[ 0.99  0.01]
 [ 0.96  0.04]]



In [65]:

    
cnf_matrix_LSCV = confusion_matrix(y_test, y_predicted_LSCV_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LSCV, classes='1',
                      title='LSCV Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LSCV, classes='1', normalize=True,
                      title='LSCV Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[154   0]
 [ 54   0]]
Normalized confusion matrix
[[ 1.  0.]
 [ 1.  0.]]



In [66]:

    
cnf_matrix_SIGMOID = confusion_matrix(y_test, y_predicted_SIGMOID_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_SIGMOID, classes='1',
                      title='SIGMOID Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_SIGMOID, classes='1', normalize=True,
                      title='SIGMOID Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[104  50]
 [ 40  14]]
Normalized confusion matrix
[[ 0.68  0.32]
 [ 0.74  0.26]]



In [67]:

    
cnf_matrix_DTC = confusion_matrix(y_test, y_predicted_DTC_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_DTC, classes='1',
                      title='DTC Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_DTC, classes='1', normalize=True,
                      title='DTC Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[99 55]
 [19 35]]
Normalized confusion matrix
[[ 0.64  0.36]
 [ 0.35  0.65]]



In [68]:

    
cnf_matrix_RFC = confusion_matrix(y_test, y_predicted_RFC_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RFC, classes='1',
                      title='RFC Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_RFC, classes='1', normalize=True,
                      title='RFC Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[115  39]
 [ 30  24]]
Normalized confusion matrix
[[ 0.75  0.25]
 [ 0.56  0.44]]



In [69]:

    
cnf_matrix_EFC = confusion_matrix(y_test, y_predicted_EFC_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_EFC, classes='1',
                      title='EFC Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_EFC, classes='1', normalize=True,
                      title='EFC Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[124  30]
 [ 26  28]]
Normalized confusion matrix
[[ 0.81  0.19]
 [ 0.48  0.52]]



In [70]:

    
cnf_matrix_ABC = confusion_matrix(y_test, y_predicted_ABC_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_ABC, classes='1',
                      title='ABC Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_ABC, classes='1', normalize=True,
                      title='ABC Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[154   0]
 [ 54   0]]
Normalized confusion matrix
[[ 1.  0.]
 [ 1.  0.]]



In [71]:

    
cnf_matrix_LR = confusion_matrix(y_test, y_predicted_LR_b)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LR, classes='1',
                      title='LR Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_LR, classes='1', normalize=True,
                      title='LR Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[154   0]
 [ 54   0]]
Normalized confusion matrix
[[ 1.  0.]
 [ 1.  0.]]

The following ensemble model only aggregate five models:



In [72]:

    
eclf3 = VotingClassifier(estimators=[
    ('rbf',clf_RBF), ('sigmoid',clf_SIGMOID), ('DTC',clf_DTC),
    ('RFC',clf_RFC),('EFC',clf_EFC)], 
                         voting='hard')
eclf3 = eclf3.fit(X_train, y_train)
y_predict_eclf3 =  eclf3.predict(X_test)

print (eclf3.get_params(deep=True))
print (eclf3.score(X_train, y_train, sample_weight=None))









    



{'rbf__refit': True, 'EFC__iid': True, 'DTC': GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'DTC__estimator__splitter': 'best', 'rbf__estimator__cache_size': 1000, 'rbf__estimator__kernel': 'rbf', 'DTC__estimator__max_leaf_nodes': None, 'DTC__estimator__max_features': None, 'rbf__estimator': SVC(C=1.0, cache_size=1000, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'sigmoid__refit': True, 'RFC__estimator__oob_score': False, 'EFC__estimator__min_impurity_split': 1e-07, 'sigmoid__scoring': None, 'sigmoid__estimator__shrinking': True, 'rbf__estimator__gamma': 'auto', 'RFC__error_score': 'raise', 'sigmoid__cv': None, 'RFC__estimator__warm_start': False, 'EFC__param_grid': {'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]}, 'RFC__estimator__verbose': 0, 'RFC__estimator__criterion': 'gini', 'EFC__estimator__min_samples_split': 2, 'EFC__n_jobs': -1, 'RFC__cv': None, 'RFC__estimator__min_samples_split': 2, 'EFC__scoring': None, 'EFC__estimator__random_state': 123, 'DTC__estimator__presort': False, 'RFC__fit_params': {}, 'EFC__estimator__verbose': 0, 'rbf__estimator__C': 1.0, 'DTC__estimator': DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'), 'sigmoid__estimator__tol': 0.001, 'EFC__estimator__n_estimators': 10, 'rbf__return_train_score': True, 'EFC__error_score': 'raise', 'rbf__estimator__tol': 0.001, 'EFC__estimator__max_leaf_nodes': None, 'rbf__estimator__verbose': False, 'DTC__verbose': 0, 'rbf__error_score': 'raise', 'EFC__return_train_score': True, 'rbf__estimator__shrinking': True, 'weights': None, 'rbf__estimator__class_weight': 'balanced', 'DTC__estimator__class_weight': 'balanced', 'EFC__estimator__class_weight': 'balanced', 'rbf': GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=1000, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'EFC__pre_dispatch': '2*n_jobs', 'rbf__pre_dispatch': '2*n_jobs', 'voting': 'hard', 'estimators': [('rbf', GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=1000, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('sigmoid', GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [1000.0, 10000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'coef0': [-1, 0, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('DTC', GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('RFC', GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 7, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('EFC', GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0))], 'EFC__estimator__criterion': 'gini', 'rbf__fit_params': {}, 'EFC__fit_params': {}, 'sigmoid__return_train_score': True, 'rbf__estimator__random_state': None, 'EFC__estimator__max_features': 'auto', 'sigmoid__estimator__probability': False, 'DTC__n_jobs': -1, 'sigmoid__estimator__degree': 3, 'rbf__estimator__coef0': 0.0, 'DTC__estimator__min_impurity_split': 1e-07, 'rbf__verbose': 0, 'sigmoid__estimator__verbose': False, 'RFC__scoring': None, 'RFC__estimator__min_weight_fraction_leaf': 0.0, 'RFC__estimator__min_samples_leaf': 1, 'sigmoid__estimator__random_state': None, 'sigmoid__estimator__decision_function_shape': None, 'RFC__estimator__n_jobs': 1, 'DTC__return_train_score': True, 'sigmoid__verbose': 0, 'RFC__estimator__min_impurity_split': 1e-07, 'EFC__estimator__min_samples_leaf': 1, 'RFC__estimator__max_features': 'auto', 'DTC__estimator__min_weight_fraction_leaf': 0.0, 'RFC__return_train_score': True, 'DTC__estimator__criterion': 'entropy', 'EFC__estimator__min_weight_fraction_leaf': 0.0, 'EFC__refit': True, 'sigmoid__error_score': 'raise', 'EFC__estimator__max_depth': None, 'sigmoid': GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [1000.0, 10000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'coef0': [-1, 0, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'RFC__refit': True, 'rbf__iid': True, 'n_jobs': 1, 'RFC__estimator__bootstrap': True, 'DTC__estimator__min_samples_split': 2, 'RFC__pre_dispatch': '2*n_jobs', 'RFC__estimator__class_weight': 'balanced', 'sigmoid__param_grid': {'C': [1000.0, 10000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1], 'coef0': [-1, 0, 1]}, 'sigmoid__estimator__max_iter': -1, 'rbf__scoring': None, 'DTC__estimator__max_depth': None, 'RFC__estimator__max_depth': None, 'rbf__n_jobs': 1, 'EFC__verbose': 0, 'sigmoid__estimator': SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'rbf__estimator__decision_function_shape': None, 'DTC__refit': True, 'DTC__error_score': 'raise', 'RFC__verbose': 0, 'rbf__estimator__probability': False, 'sigmoid__n_jobs': -1, 'sigmoid__estimator__gamma': 'auto', 'sigmoid__fit_params': {}, 'DTC__pre_dispatch': '2*n_jobs', 'sigmoid__estimator__C': 1.0, 'DTC__fit_params': {}, 'sigmoid__estimator__kernel': 'sigmoid', 'EFC__estimator': ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False), 'sigmoid__iid': True, 'RFC__estimator': RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False), 'EFC__estimator__n_jobs': 1, 'RFC': GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 7, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'rbf__estimator__degree': 3, 'sigmoid__estimator__cache_size': 200, 'DTC__scoring': None, 'RFC__iid': True, 'sigmoid__pre_dispatch': '2*n_jobs', 'DTC__param_grid': {'max_depth': [2, 3, 4, 5, 6, 7]}, 'DTC__iid': True, 'RFC__estimator__n_estimators': 10, 'rbf__cv': None, 'RFC__estimator__max_leaf_nodes': None, 'sigmoid__estimator__class_weight': 'balanced', 'rbf__param_grid': {'C': [0.001, 0.01, 1, 1000.0, 5000.0, 10000.0, 50000.0, 100000.0], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1]}, 'DTC__estimator__random_state': 123, 'RFC__n_jobs': -1, 'EFC__estimator__warm_start': False, 'DTC__estimator__min_samples_leaf': 1, 'RFC__param_grid': {'n_estimators': [3, 5, 7, 10], 'max_depth': [2, 3, 4, 5, 6, 7]}, 'EFC__estimator__oob_score': False, 'rbf__estimator__max_iter': -1, 'RFC__estimator__random_state': 123, 'sigmoid__estimator__coef0': 0.0, 'EFC__estimator__bootstrap': False, 'DTC__cv': None, 'EFC__cv': None, 'EFC': GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)}
0.92270531401



In [73]:

    
print(y_predict_eclf3)









    



[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1
 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 0
 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0]



In [74]:

    
scores_eclf3 = cross_val_score(eclf3 , X_train, y_train, cv=4, n_jobs=-1)
print(scores_eclf3 )
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_eclf3.mean(), scores_eclf3.std() * 2))

print (metrics.classification_report(y_test, y_predict_eclf3))









    



[ 0.74  0.74  0.67  0.71]
Accuracy: 0.7149 (+/- 0.0564)
             precision    recall  f1-score   support

          0       0.81      0.88      0.84       154
          1       0.55      0.43      0.48        54

avg / total       0.74      0.76      0.75       208



In [75]:

    
cnf_matrix_eclf3 = confusion_matrix(y_test, y_predict_eclf3)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf3, classes='1',
                      title='ECLF Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf3, classes='1', normalize=True,
                      title='ECLF Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[135  19]
 [ 31  23]]
Normalized confusion matrix
[[ 0.88  0.12]
 [ 0.57  0.43]]

This ensemble voting model aggregates decision tree and the extremely tree models:



In [76]:

    
eclf4 = VotingClassifier(estimators=[
    ('DTC',clf_DTC), ('EFC',clf_EFC)], 
                         voting='hard')
eclf4 = eclf4.fit(X_train, y_train)
y_predict_eclf4 =  eclf4.predict(X_test)

print (eclf4.get_params(deep=True))
print (eclf4.score(X_train, y_train, sample_weight=None))









    



{'DTC__return_train_score': True, 'EFC__iid': True, 'DTC': GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0), 'EFC__estimator__min_samples_leaf': 1, 'DTC__estimator__splitter': 'best', 'DTC__estimator__min_weight_fraction_leaf': 0.0, 'DTC__estimator__criterion': 'entropy', 'EFC__refit': True, 'DTC__estimator__max_leaf_nodes': None, 'DTC__estimator__max_features': None, 'EFC__estimator__min_weight_fraction_leaf': 0.0, 'EFC__estimator__random_state': 123, 'EFC__estimator__min_impurity_split': 1e-07, 'DTC__verbose': 0, 'DTC__error_score': 'raise', 'DTC__estimator__min_samples_split': 2, 'EFC__pre_dispatch': '2*n_jobs', 'EFC__estimator': ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False), 'EFC__estimator__min_samples_split': 2, 'DTC__estimator__max_depth': None, 'EFC__scoring': None, 'EFC__param_grid': {'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]}, 'EFC__estimator__verbose': 0, 'DTC__refit': True, 'DTC__estimator': DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'), 'EFC__estimator__n_estimators': 10, 'EFC__error_score': 'raise', 'DTC__estimator__presort': False, 'DTC__pre_dispatch': '2*n_jobs', 'EFC__estimator__max_leaf_nodes': None, 'DTC__fit_params': {}, 'EFC__n_jobs': -1, 'n_jobs': 1, 'EFC__estimator__n_jobs': 1, 'weights': None, 'DTC__estimator__class_weight': 'balanced', 'EFC__estimator__class_weight': 'balanced', 'DTC__scoring': None, 'voting': 'hard', 'DTC__param_grid': {'max_depth': [2, 3, 4, 5, 6, 7]}, 'DTC__iid': True, 'estimators': [('DTC', GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)), ('EFC', GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0))], 'EFC__estimator__criterion': 'gini', 'EFC__verbose': 0, 'EFC__fit_params': {}, 'EFC__estimator__max_depth': None, 'DTC__estimator__random_state': 123, 'EFC__estimator__max_features': 'auto', 'DTC__n_jobs': -1, 'EFC__estimator__warm_start': False, 'DTC__estimator__min_samples_leaf': 1, 'DTC__estimator__min_impurity_split': 1e-07, 'EFC__estimator__oob_score': False, 'EFC__return_train_score': True, 'EFC__estimator__bootstrap': False, 'DTC__cv': None, 'EFC__cv': None, 'EFC': GridSearchCV(cv=None, error_score='raise',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [3, 5, 10], 'max_depth': [2, 3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)}
0.840579710145



In [77]:

    
scores_eclf4 = cross_val_score(eclf4 , X_train, y_train, cv=4, n_jobs=-1)
print(scores_eclf4 )
print("Accuracy: %0.4f (+/- %0.4f)" % (scores_eclf4.mean(), scores_eclf4.std() * 2))

print (metrics.classification_report(y_test, y_predict_eclf4))









    



[ 0.74  0.72  0.67  0.72]
Accuracy: 0.7133 (+/- 0.0529)
             precision    recall  f1-score   support

          0       0.82      0.86      0.84       154
          1       0.54      0.48      0.51        54

avg / total       0.75      0.76      0.75       208



In [78]:

    
cnf_matrix_eclf4 = confusion_matrix(y_test, y_predict_eclf4)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf4, classes='1',
                      title='ECLF Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_eclf4, classes='1', normalize=True,
                      title='ECLF Normalized confusion matrix')

plt.show()









    



Confusion matrix, without normalization
[[132  22]
 [ 28  26]]
Normalized confusion matrix
[[ 0.86  0.14]
 [ 0.52  0.48]]

The best model found here is the ensemble model with the decision tree and the extremely tree, even though the ensemble model with five aggregate methods shows a slightly better score, applying the principle of Occam's razor make the simplest better. At this time, the resulting model can accurately predict the deaths given a series of medical examinations, as proposed in the hypothesis. While accuracy is not the best (76%), I think it may be a good start for future investigations of interdisciplinary teams in ICU forecasting diseases.

From the forest model is possible to find how features weights in the results, such weights are called importance. As you can see only 129 features are important to the model, the rest has no weights. As you can expect, the antibiotic sensitivity are the most important feature (together weights 57% of importance) and AMIKACIN antibiotic the most important feature of the sample. Every feature from age group weight 0.97% in average, followed by category feature which everyone weights less than 1%.



In [88]:

    
# Code source:  http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
from sklearn.datasets import make_classification
forest = clf_EFC_b

forest.fit(X_train, y_train)
feature_names = X_train.columns
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_train.shape[1]):
    if importances[indices[f]]>0:
        print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))


#plt.xticks(range(heart_train.shape[1]), )        
        
        
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names)
plt.xlim([-1, X_train.shape[1]])
plt.show()









    



Feature ranking:
1. feature 240 (0.053948)
2. feature 273 (0.034461)
3. feature 237 (0.030208)
4. feature 427 (0.029493)
5. feature 416 (0.029326)
6. feature 37 (0.027816)
7. feature 517 (0.025309)
8. feature 118 (0.024947)
9. feature 16 (0.023853)
10. feature 658 (0.020297)
11. feature 186 (0.019731)
12. feature 480 (0.019372)
13. feature 488 (0.016219)
14. feature 588 (0.015917)
15. feature 414 (0.015770)
16. feature 29 (0.015359)
17. feature 470 (0.015139)
18. feature 293 (0.014874)
19. feature 238 (0.014657)
20. feature 187 (0.014469)
21. feature 197 (0.014247)
22. feature 223 (0.014117)
23. feature 23 (0.013355)
24. feature 523 (0.013277)
25. feature 259 (0.012357)
26. feature 98 (0.011632)
27. feature 128 (0.011117)
28. feature 6 (0.010978)
29. feature 123 (0.010387)
30. feature 359 (0.010151)
31. feature 243 (0.009812)
32. feature 563 (0.009535)
33. feature 713 (0.009299)
34. feature 566 (0.008907)
35. feature 508 (0.008858)
36. feature 546 (0.008840)
37. feature 328 (0.008714)
38. feature 412 (0.008675)
39. feature 346 (0.008508)
40. feature 163 (0.008166)
41. feature 428 (0.008068)
42. feature 459 (0.007914)
43. feature 144 (0.007468)
44. feature 18 (0.007330)
45. feature 530 (0.007282)
46. feature 711 (0.007215)
47. feature 49 (0.007207)
48. feature 415 (0.007053)
49. feature 246 (0.006948)
50. feature 389 (0.006639)
51. feature 309 (0.006631)
52. feature 120 (0.006131)
53. feature 682 (0.005969)
54. feature 452 (0.005954)
55. feature 516 (0.005870)
56. feature 27 (0.005821)
57. feature 62 (0.005820)
58. feature 507 (0.005810)
59. feature 225 (0.005809)
60. feature 614 (0.005741)
61. feature 311 (0.005737)
62. feature 210 (0.005710)
63. feature 360 (0.005488)
64. feature 362 (0.005475)
65. feature 361 (0.005153)
66. feature 235 (0.004983)
67. feature 68 (0.004954)
68. feature 47 (0.004836)
69. feature 162 (0.004677)
70. feature 460 (0.004515)
71. feature 149 (0.004456)
72. feature 217 (0.004441)
73. feature 241 (0.004309)
74. feature 301 (0.004247)
75. feature 78 (0.004204)
76. feature 348 (0.004070)
77. feature 396 (0.003932)
78. feature 22 (0.003912)
79. feature 474 (0.003893)
80. feature 473 (0.003785)
81. feature 288 (0.003740)
82. feature 97 (0.003711)
83. feature 189 (0.003569)
84. feature 398 (0.003358)
85. feature 376 (0.003353)
86. feature 199 (0.003266)
87. feature 506 (0.003146)
88. feature 145 (0.003080)
89. feature 289 (0.003072)
90. feature 340 (0.002948)
91. feature 3 (0.002851)
92. feature 590 (0.002843)
93. feature 703 (0.002832)
94. feature 544 (0.002771)
95. feature 253 (0.002704)
96. feature 528 (0.002636)
97. feature 567 (0.002572)
98. feature 441 (0.002536)
99. feature 550 (0.002472)
100. feature 142 (0.002460)
101. feature 370 (0.002403)
102. feature 514 (0.002313)
103. feature 59 (0.002259)
104. feature 600 (0.002252)
105. feature 76 (0.002185)
106. feature 165 (0.002180)
107. feature 454 (0.002058)
108. feature 433 (0.002022)
109. feature 312 (0.002015)
110. feature 229 (0.001942)
111. feature 298 (0.001920)
112. feature 461 (0.001873)
113. feature 109 (0.001870)
114. feature 292 (0.001703)
115. feature 205 (0.001697)
116. feature 640 (0.001673)
117. feature 19 (0.001461)
118. feature 5 (0.001449)
119. feature 708 (0.001426)
120. feature 531 (0.001416)
121. feature 75 (0.001401)
122. feature 515 (0.001080)
123. feature 245 (0.000541)
124. feature 599 (0.000374)
125. feature 538 (0.000362)
126. feature 294 (0.000252)
127. feature 571 (0.000169)
128. feature 317 (0.000116)
129. feature 453 (0.000112)



In [108]:

    
from sklearn.datasets import make_classification
forest = clf_EFC_b

forest.fit(X_train, y_train)
feature_names = X_train.columns
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]


impor = []
for f in range(X_train.shape[1]):
    if importances[indices[f]]>0:
        impor.append({'Feature': feature_names[f] , 'Importance': importances[indices[f]]})

        

feature_importance = pd.DataFrame(impor).sort_values('Importance',ascending = False)
print(feature_importance.to_string())









    



                                               Feature  Importance
0                                     ab_name_AMIKACIN    0.053948
1                                   ab_name_AMPICILLIN    0.034461
2                         ab_name_AMPICILLIN/SULBACTAM    0.030208
3                                    ab_name_CEFAZOLIN    0.029493
4                                     ab_name_CEFEPIME    0.029326
5                                  ab_name_CEFTAZIDIME    0.027816
6                                  ab_name_CEFTRIAXONE    0.025309
7                                   ab_name_CEFUROXIME    0.024947
8                              ab_name_CHLORAMPHENICOL    0.023853
9                                ab_name_CIPROFLOXACIN    0.020297
10                                 ab_name_CLINDAMYCIN    0.019731
11                                  ab_name_DAPTOMYCIN    0.019372
12                                ab_name_ERYTHROMYCIN    0.016219
13                                  ab_name_GENTAMICIN    0.015917
14                                    ab_name_IMIPENEM    0.015770
15                                ab_name_LEVOFLOXACIN    0.015359
16                                   ab_name_LINEZOLID    0.015139
17                                   ab_name_MEROPENEM    0.014874
18                              ab_name_NITROFURANTOIN    0.014657
19                                   ab_name_OXACILLIN    0.014469
20                                  ab_name_PENICILLIN    0.014247
21                                ab_name_PENICILLIN G    0.014117
22                                ab_name_PIPERACILLIN    0.013355
23                           ab_name_PIPERACILLIN/TAZO    0.013277
24                                    ab_name_RIFAMPIN    0.012357
25                                ab_name_TETRACYCLINE    0.011632
26                                  ab_name_TOBRAMYCIN    0.011117
27                          ab_name_TRIMETHOPRIM/SULFA    0.010978
28                                  ab_name_VANCOMYCIN    0.010387
29                                     age_group_adult    0.010151
30                                   age_group_elderly    0.009812
31                                   age_group_neonate    0.009535
32                                age_group_oldest old    0.009299
33                                        category_ABG    0.008907
34                                      category_ABG'S    0.008858
35                                      category_ABG's    0.008840
36                    category_Access Lines - Invasive    0.008714
37                           category_Adm History/FHPA    0.008675
38                                     category_Alarms    0.008508
39                                category_Blood Gases    0.008166
40                                        category_CSF    0.008068
41                category_Cardiovascular (Pacer Data)    0.007914
42                                  category_Chemistry    0.007468
43                                      category_Coags    0.007330
44                                   category_Dialysis    0.007282
45                                 category_Drug Level    0.007215
46                                 category_Drug level    0.007207
47                                    category_Enzymes    0.007053
48                                      category_GI/GU    0.006948
49                                    category_General    0.006639
50                                 category_Hematology    0.006631
51                                  category_Heme/Coag    0.006131
52                               category_Hemodynamics    0.005969
53                                       category_IABP    0.005954
54                                       category_Labs    0.005870
55                                 category_Mixed VBGs    0.005821
56                         category_Mixed Venous Gases    0.005820
57                                      category_NICOM    0.005810
58                                     category_OB-GYN    0.005809
59                                   category_OT Notes    0.005741
60                                 category_Other ABGs    0.005737
61                              category_Pain/Sedation    0.005710
62                                      category_PiCCO    0.005488
63                                category_Quick Admit    0.005475
64                                category_Respiratory    0.005153
65                        category_Routine Vital Signs    0.004983
66                         category_Scores - APACHE II    0.004954
67                     category_Scores - APACHE IV (2)    0.004836
68                          category_Skin - Impairment    0.004677
69                                 category_Treatments    0.004515
70                                      category_Urine    0.004456
71                                      category_VBG'S    0.004441
72                                      category_VBG's    0.004309
73                                 category_Venous ABG    0.004247
74                                            gender_F    0.004204
75                                            gender_M    0.004070
76                                    interpretation_I    0.003932
77                                    interpretation_P    0.003912
78                                    interpretation_R    0.003893
79                                    interpretation_S    0.003785
80                                  label_ABG CHLOIRDE    0.003740
81                                 label_ABG POTASSIUM    0.003711
82                                 label_ABG Potassium    0.003569
83                                    label_ABG SODIUM    0.003358
84                                label_ABI Ankle BP L    0.003353
85                                label_ABI Ankle BP R    0.003266
86                             label_ABI Brachial BP L    0.003146
87                             label_ABI Brachial BP R    0.003080
88                                           label_ACT    0.003072
89                                 label_ACT (102-142)    0.002948
90                                           label_ALT    0.002851
91                                     label_APACHE II    0.002843
92                      label_APACHE II PDR - Adjusted    0.002832
93               label_APACHE II Predecited Death Rate    0.002771
94                                     label_APACHEIII    0.002704
95                                           label_APS    0.002636
96                              label_ART BP Diastolic    0.002572
97                               label_ART BP Systolic    0.002536
98                                   label_ART BP mean    0.002472
99               label_ART Blood Pressure Alarm - High    0.002460
100               label_ART Blood Pressure Alarm - Low    0.002403
101                             label_ART Lumen Volume    0.002313
102                                          label_AST    0.002259
103                                        label_ATC %    0.002252
104                           label_AaDO2ApacheIIValue    0.002185
105                              label_Access Pressure    0.002180
106                      label_Activated Clotting Time    0.002058
107      label_Activity HR - Aerobic Activity Response    0.002022
108               label_Activity HR - Aerobic Capacity    0.002015
109           label_Activity O2 Sat - Aerobic Capacity    0.001942
110  label_Activity O2 sat - Aerobic Activity Response    0.001920
111      label_Activity RR - Aerobic Activity Response    0.001873
112               label_Activity RR - Aerobic Capacity    0.001870
113                        label_Admission Weight (Kg)    0.001703
114                      label_Admission Weight (lbs.)    0.001697
115                             label_AgeApacheIIScore    0.001673
116                             label_AgeApacheIIValue    0.001461
117                            label_AgeScore_ApacheIV    0.001449
118                                      label_Albumin    0.001426
119                           label_Albumin  (3.9-4.8)    0.001416
120                               label_Albumin (>3.2)    0.001401
121                        label_AlbuminScore_ApacheIV    0.001080
122                             label_Albumin_ApacheIV    0.000541
123                               label_Alk. Phosphate    0.000374
124                         label_Alkaline Phosphatase    0.000362
125                           label_Alkaline Phosphate    0.000252
126                             label_Alsius Bath Temp    0.000169
127                                      label_Ammonia    0.000116
128                                      label_Amylase    0.000112



In [103]:

    
feature_importance.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 2 columns):
Feature       129 non-null object
Importance    129 non-null float64
dtypes: float64(1), object(1)
memory usage: 2.1+ KB



In [105]:

    
feature_importance['Importance'].sum()









    Out[105]:





1.0



In [122]:

    
#surveys_df[surveys_df.year == 2002]
more_important= feature_importance[feature_importance.Importance >= 0.009]
more_important









    Out[122]:






  
    
      
      Feature
      Importance
    
  
  
    
      0
      ab_name_AMIKACIN
      0.053948
    
    
      1
      ab_name_AMPICILLIN
      0.034461
    
    
      2
      ab_name_AMPICILLIN/SULBACTAM
      0.030208
    
    
      3
      ab_name_CEFAZOLIN
      0.029493
    
    
      4
      ab_name_CEFEPIME
      0.029326
    
    
      5
      ab_name_CEFTAZIDIME
      0.027816
    
    
      6
      ab_name_CEFTRIAXONE
      0.025309
    
    
      7
      ab_name_CEFUROXIME
      0.024947
    
    
      8
      ab_name_CHLORAMPHENICOL
      0.023853
    
    
      9
      ab_name_CIPROFLOXACIN
      0.020297
    
    
      10
      ab_name_CLINDAMYCIN
      0.019731
    
    
      11
      ab_name_DAPTOMYCIN
      0.019372
    
    
      12
      ab_name_ERYTHROMYCIN
      0.016219
    
    
      13
      ab_name_GENTAMICIN
      0.015917
    
    
      14
      ab_name_IMIPENEM
      0.015770
    
    
      15
      ab_name_LEVOFLOXACIN
      0.015359
    
    
      16
      ab_name_LINEZOLID
      0.015139
    
    
      17
      ab_name_MEROPENEM
      0.014874
    
    
      18
      ab_name_NITROFURANTOIN
      0.014657
    
    
      19
      ab_name_OXACILLIN
      0.014469
    
    
      20
      ab_name_PENICILLIN
      0.014247
    
    
      21
      ab_name_PENICILLIN G
      0.014117
    
    
      22
      ab_name_PIPERACILLIN
      0.013355
    
    
      23
      ab_name_PIPERACILLIN/TAZO
      0.013277
    
    
      24
      ab_name_RIFAMPIN
      0.012357
    
    
      25
      ab_name_TETRACYCLINE
      0.011632
    
    
      26
      ab_name_TOBRAMYCIN
      0.011117
    
    
      27
      ab_name_TRIMETHOPRIM/SULFA
      0.010978
    
    
      28
      ab_name_VANCOMYCIN
      0.010387
    
    
      29
      age_group_adult
      0.010151
    
    
      30
      age_group_elderly
      0.009812
    
    
      31
      age_group_neonate
      0.009535
    
    
      32
      age_group_oldest old
      0.009299



In [127]:

    
more_important.plot(kind='bar')









    Out[127]:





<matplotlib.axes._subplots.AxesSubplot at 0x22fe1057d30>



In [ ]:

	subject_id	gender	last_admit_age	age_group	category	label	valuenum_avg	org_name	ab_name	interpretation
0	157	M	80.54	elderly	ABG	Arterial Base Excess	-3.75	STAPH AUREUS COAG +	TETRACYCLINE	S
1	157	M	80.54	elderly	ABG	Arterial Base Excess	-3.75	STAPH AUREUS COAG +	RIFAMPIN	S
2	157	M	80.54	elderly	ABG	Arterial Base Excess	-3.75	STAPH AUREUS COAG +	ERYTHROMYCIN	R
3	157	M	80.54	elderly	ABG	Arterial Base Excess	-3.75	STAPH AUREUS COAG +	GENTAMICIN	S
4	157	M	80.54	elderly	ABG	Arterial Base Excess	-3.75	STAPH AUREUS COAG +	VANCOMYCIN	S

		ab_name_AMIKACIN	ab_name_AMPICILLIN	ab_name_AMPICILLIN/SULBACTAM	ab_name_CEFAZOLIN	ab_name_CEFEPIME	ab_name_CEFTAZIDIME	ab_name_CEFTRIAXONE	ab_name_CEFUROXIME	ab_name_CHLORAMPHENICOL	ab_name_CIPROFLOXACIN	...	org_name_STAPH AUREUS COAG +	org_name_STAPHYLOCOCCUS EPIDERMIDIS	org_name_STAPHYLOCOCCUS HOMINIS	org_name_STAPHYLOCOCCUS LUGDUNENSIS	org_name_STAPHYLOCOCCUS, COAGULASE NEGATIVE	org_name_STENOTROPHOMONAS (XANTHOMONAS) MALTOPHILIA	org_name_STREPTOCOCCUS ANGINOSUS (MILLERI) GROUP	org_name_STREPTOCOCCUS PNEUMONIAE	org_name_VIRIDANS STREPTOCOCCI	valuenum_avg
subject_id	hospital_expire_flag
41	0	0.0	0.018182	0.036364	0.036364	0.036364	0.036364	0.036364	0.036364	0.0	0.018182	...	0.236364	0.0	0.0	0	0.163636	0.072727	0.0	0.0	0	36.219025
157	0	0.0	0.076923	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	...	0.615385	0.0	0.0	0	0.000000	0.000000	0.0	0.0	0	44.302839
177	0	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	...	1.000000	0.0	0.0	0	0.000000	0.000000	0.0	0.0	0	39.911055
203	0	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	...	1.000000	0.0	0.0	0	0.000000	0.000000	0.0	0.0	0	53.690473
236	0	0.0	0.000000	0.090909	0.090909	0.090909	0.090909	0.090909	0.000000	0.0	0.090909	...	0.000000	0.0	0.0	0	0.000000	0.000000	0.0	0.0	0	69.293584

	Feature	Importance
0	ab_name_AMIKACIN	0.053948
1	ab_name_AMPICILLIN	0.034461
2	ab_name_AMPICILLIN/SULBACTAM	0.030208
3	ab_name_CEFAZOLIN	0.029493
4	ab_name_CEFEPIME	0.029326
5	ab_name_CEFTAZIDIME	0.027816
6	ab_name_CEFTRIAXONE	0.025309
7	ab_name_CEFUROXIME	0.024947
8	ab_name_CHLORAMPHENICOL	0.023853
9	ab_name_CIPROFLOXACIN	0.020297
10	ab_name_CLINDAMYCIN	0.019731
11	ab_name_DAPTOMYCIN	0.019372
12	ab_name_ERYTHROMYCIN	0.016219
13	ab_name_GENTAMICIN	0.015917
14	ab_name_IMIPENEM	0.015770
15	ab_name_LEVOFLOXACIN	0.015359
16	ab_name_LINEZOLID	0.015139
17	ab_name_MEROPENEM	0.014874
18	ab_name_NITROFURANTOIN	0.014657
19	ab_name_OXACILLIN	0.014469
20	ab_name_PENICILLIN	0.014247
21	ab_name_PENICILLIN G	0.014117
22	ab_name_PIPERACILLIN	0.013355
23	ab_name_PIPERACILLIN/TAZO	0.013277
24	ab_name_RIFAMPIN	0.012357
25	ab_name_TETRACYCLINE	0.011632
26	ab_name_TOBRAMYCIN	0.011117
27	ab_name_TRIMETHOPRIM/SULFA	0.010978
28	ab_name_VANCOMYCIN	0.010387
29	age_group_adult	0.010151
30	age_group_elderly	0.009812
31	age_group_neonate	0.009535
32	age_group_oldest old	0.009299