I recommend installing Anaconda for ease: https://www.continuum.io/downloads certain packages i.e. seaborn, do not come with the standard installation. A simple search for how to 'install seaborn with anaconda' will easily find the installation instructions. i.e. in this case: http://seaborn.pydata.org/installing.html in terminal (mac): conda install seaborn

Alexandru Agachi m.a.agachi@gmail.com
Background: pursued degrees and graduate and postgraduate diplomas in international relations, energy science, surgical robotics, neuro anatomy and imagery, and biomedical innovation Working with a data driven firm in London that operates on financial markets, and teaching a big data module in the biomedical innovation degree at Pierre et Marie Curie University in Paris.

Why

Why statistics

Why focus on data

Better to be in an expanding world and not quite in exactly the right field, than to be in a contracting world where people's worst behavior comes out.
Eric Weinstein, Fellow University of Oxford, MD Thiel Capital.

Why healthcare Ripe for data analysis 30% of world's data

The biomedical sciences have been the pillar of the health care system for a long time now. The new system will have two equal pillars — the biomedical sciences and the data sciences.
Dr Scott Zeger, Johns Hopkins University.

Roadmap - Blueprint for statistical data exploration



In [3]:

    
#import the packages we will use
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#use simplest tool available
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import statsmodels.api as sm
import scipy.stats
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
#from sklearn import preprocessing
#from sklearn.linear_model import LassoLarsCV
#from sklearn.cluster import KMeans
%matplotlib inline
sns.set(style="whitegrid", color_codes=True)

#Whenever you are not sure of a function or parameter used, please verify documentation for that function at:
#http://pandas.pydata.org/pandas-docs/stable/









    



//anaconda/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

All credit for the data, and our many thanks, go to the principal investigators who collected this data and made it available:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Online repositories for data used and for this notebook:

http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/ https://github.com/AlexandruAgachi/introductory_statistics_tutorial/blob/master/Introductory%20statistical%20data%20analysis%20with%20pandas%20PyData%20Berlin%202017.ipynb

here specify your own directory for where you stored the datasets on your computer:

For me: cd Downloads/Heart disease dataset

note: when downloading the datasets, some people had to rename them manually with a 'txt' ending the four data files that interest us should be named:

processed.cleveland.data.txt processed.hungarian.data.txt processed.switzerland.data.txt processed.va.data.txt

Data management is an integral part of your research process
Choices you make here will influence your entire study



In [20]:

    
#let's load the first dataset in pandas with the read_csv function. This will create pandas dataframe objects per below.
cleveland = pd.read_csv('processed.cleveland.data.txt', header=None)



In [93]:

    
cleveland.head(10)



In [25]:

    
hungary = pd.read_csv('processed.hungarian.data.txt', header=None)



In [26]:

    
hungary.head(10)



In [27]:

    
switzerland = pd.read_csv('processed.switzerland.data.txt', header=None)



In [28]:

    
switzerland.head(10)



In [30]:

    
va = pd.read_csv('processed.va.data.txt', header=None)



In [31]:

    
va.head(10)



In [37]:

    
df = pd.concat([cleveland, hungary, va, switzerland])



In [38]:

    
df









    Out[38]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
    
  
  
    
      0
      63.0
      1.0
      1.0
      145
      233
      1
      2
      150
      0
      2.3
      3
      0.0
      6.0
      0
    
    
      1
      67.0
      1.0
      4.0
      160
      286
      0
      2
      108
      1
      1.5
      2
      3.0
      3.0
      2
    
    
      2
      67.0
      1.0
      4.0
      120
      229
      0
      2
      129
      1
      2.6
      2
      2.0
      7.0
      1
    
    
      3
      37.0
      1.0
      3.0
      130
      250
      0
      0
      187
      0
      3.5
      3
      0.0
      3.0
      0
    
    
      4
      41.0
      0.0
      2.0
      130
      204
      0
      2
      172
      0
      1.4
      1
      0.0
      3.0
      0
    
    
      5
      56.0
      1.0
      2.0
      120
      236
      0
      0
      178
      0
      0.8
      1
      0.0
      3.0
      0
    
    
      6
      62.0
      0.0
      4.0
      140
      268
      0
      2
      160
      0
      3.6
      3
      2.0
      3.0
      3
    
    
      7
      57.0
      0.0
      4.0
      120
      354
      0
      0
      163
      1
      0.6
      1
      0.0
      3.0
      0
    
    
      8
      63.0
      1.0
      4.0
      130
      254
      0
      2
      147
      0
      1.4
      2
      1.0
      7.0
      2
    
    
      9
      53.0
      1.0
      4.0
      140
      203
      1
      2
      155
      1
      3.1
      3
      0.0
      7.0
      1
    
    
      10
      57.0
      1.0
      4.0
      140
      192
      0
      0
      148
      0
      0.4
      2
      0.0
      6.0
      0
    
    
      11
      56.0
      0.0
      2.0
      140
      294
      0
      2
      153
      0
      1.3
      2
      0.0
      3.0
      0
    
    
      12
      56.0
      1.0
      3.0
      130
      256
      1
      2
      142
      1
      0.6
      2
      1.0
      6.0
      2
    
    
      13
      44.0
      1.0
      2.0
      120
      263
      0
      0
      173
      0
      0
      1
      0.0
      7.0
      0
    
    
      14
      52.0
      1.0
      3.0
      172
      199
      1
      0
      162
      0
      0.5
      1
      0.0
      7.0
      0
    
    
      15
      57.0
      1.0
      3.0
      150
      168
      0
      0
      174
      0
      1.6
      1
      0.0
      3.0
      0
    
    
      16
      48.0
      1.0
      2.0
      110
      229
      0
      0
      168
      0
      1
      3
      0.0
      7.0
      1
    
    
      17
      54.0
      1.0
      4.0
      140
      239
      0
      0
      160
      0
      1.2
      1
      0.0
      3.0
      0
    
    
      18
      48.0
      0.0
      3.0
      130
      275
      0
      0
      139
      0
      0.2
      1
      0.0
      3.0
      0
    
    
      19
      49.0
      1.0
      2.0
      130
      266
      0
      0
      171
      0
      0.6
      1
      0.0
      3.0
      0
    
    
      20
      64.0
      1.0
      1.0
      110
      211
      0
      2
      144
      1
      1.8
      2
      0.0
      3.0
      0
    
    
      21
      58.0
      0.0
      1.0
      150
      283
      1
      2
      162
      0
      1
      1
      0.0
      3.0
      0
    
    
      22
      58.0
      1.0
      2.0
      120
      284
      0
      2
      160
      0
      1.8
      2
      0.0
      3.0
      1
    
    
      23
      58.0
      1.0
      3.0
      132
      224
      0
      2
      173
      0
      3.2
      1
      2.0
      7.0
      3
    
    
      24
      60.0
      1.0
      4.0
      130
      206
      0
      2
      132
      1
      2.4
      2
      2.0
      7.0
      4
    
    
      25
      50.0
      0.0
      3.0
      120
      219
      0
      0
      158
      0
      1.6
      2
      0.0
      3.0
      0
    
    
      26
      58.0
      0.0
      3.0
      120
      340
      0
      0
      172
      0
      0
      1
      0.0
      3.0
      0
    
    
      27
      66.0
      0.0
      1.0
      150
      226
      0
      0
      114
      0
      2.6
      3
      0.0
      3.0
      0
    
    
      28
      43.0
      1.0
      4.0
      150
      247
      0
      0
      171
      0
      1.5
      1
      0.0
      3.0
      0
    
    
      29
      40.0
      1.0
      4.0
      110
      167
      0
      2
      114
      1
      2
      2
      0.0
      7.0
      3
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      93
      62.0
      0.0
      4.0
      120
      0
      ?
      1
      123
      1
      1.7
      3
      ?
      ?
      1
    
    
      94
      62.0
      1.0
      1.0
      120
      0
      ?
      2
      134
      0
      -.8
      2
      2
      ?
      1
    
    
      95
      62.0
      1.0
      3.0
      160
      0
      0
      0
      72
      1
      0
      2
      ?
      3
      3
    
    
      96
      62.0
      1.0
      4.0
      115
      0
      ?
      0
      128
      1
      2.5
      3
      ?
      ?
      2
    
    
      97
      62.0
      1.0
      4.0
      115
      0
      ?
      0
      72
      1
      -.5
      2
      ?
      3
      1
    
    
      98
      62.0
      1.0
      4.0
      150
      0
      ?
      1
      78
      0
      2
      2
      ?
      7
      3
    
    
      99
      63.0
      1.0
      4.0
      100
      0
      ?
      0
      109
      0
      -.9
      2
      ?
      ?
      1
    
    
      100
      63.0
      1.0
      4.0
      140
      0
      ?
      2
      149
      0
      2
      1
      ?
      ?
      2
    
    
      101
      63.0
      1.0
      4.0
      150
      0
      0
      0
      86
      1
      2
      2
      ?
      ?
      3
    
    
      102
      63.0
      1.0
      4.0
      150
      0
      ?
      1
      154
      0
      3.7
      1
      ?
      ?
      3
    
    
      103
      63.0
      1.0
      4.0
      185
      0
      0
      0
      98
      1
      0
      1
      ?
      7
      1
    
    
      104
      64.0
      0.0
      4.0
      200
      0
      0
      0
      140
      1
      1
      2
      ?
      3
      3
    
    
      105
      64.0
      0.0
      4.0
      95
      0
      ?
      0
      145
      0
      1.1
      3
      ?
      ?
      1
    
    
      106
      64.0
      1.0
      4.0
      110
      0
      ?
      0
      114
      1
      1.3
      3
      ?
      ?
      1
    
    
      107
      65.0
      1.0
      4.0
      115
      0
      0
      0
      93
      1
      0
      2
      ?
      7
      1
    
    
      108
      65.0
      1.0
      4.0
      145
      0
      ?
      1
      67
      0
      ?
      ?
      ?
      6
      3
    
    
      109
      65.0
      1.0
      4.0
      155
      0
      ?
      0
      154
      0
      1
      1
      ?
      ?
      0
    
    
      110
      65.0
      1.0
      4.0
      160
      0
      1
      1
      122
      0
      ?
      ?
      ?
      7
      3
    
    
      111
      66.0
      0.0
      4.0
      155
      0
      ?
      0
      90
      0
      0
      ?
      ?
      7
      1
    
    
      112
      66.0
      1.0
      4.0
      150
      0
      0
      0
      108
      1
      2
      2
      ?
      7
      3
    
    
      113
      67.0
      1.0
      1.0
      145
      0
      0
      2
      125
      0
      0
      2
      ?
      3
      2
    
    
      114
      68.0
      1.0
      4.0
      135
      0
      0
      1
      120
      1
      0
      1
      ?
      7
      3
    
    
      115
      68.0
      1.0
      4.0
      145
      0
      ?
      0
      136
      0
      1.8
      1
      ?
      ?
      1
    
    
      116
      69.0
      1.0
      4.0
      135
      0
      0
      0
      130
      0
      0
      2
      ?
      6
      1
    
    
      117
      69.0
      1.0
      4.0
      ?
      0
      0
      1
      ?
      ?
      ?
      ?
      ?
      7
      3
    
    
      118
      70.0
      1.0
      4.0
      115
      0
      0
      1
      92
      1
      0
      2
      ?
      7
      1
    
    
      119
      70.0
      1.0
      4.0
      140
      0
      1
      0
      157
      1
      2
      2
      ?
      7
      3
    
    
      120
      72.0
      1.0
      3.0
      160
      0
      ?
      2
      114
      0
      1.6
      2
      2
      ?
      0
    
    
      121
      73.0
      0.0
      3.0
      160
      0
      0
      1
      121
      0
      0
      1
      ?
      3
      1
    
    
      122
      74.0
      1.0
      2.0
      145
      0
      ?
      1
      123
      0
      1.3
      1
      ?
      ?
      1
    
  

920 rows × 14 columns



In [57]:

    
#we rename all columns to make them more legible
df.columns = ['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs', 'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression', 'slope', 'fluoroscopy',
             'defect', 'diagnosis']



In [603]:

    
#let's look at our dataframe. the head(10) option tells pandas to return only the first ten rows of our dataframe.
df.head(10)









    Out[603]:






  
    
      
      age
      sex
      chest_pain
      rest_bp
      cholesterol
      fasting_bs
      rest_ecg
      max_heart_rate
      exercise_angina
      st_depression
      slope
      fluoroscopy
      defect
      diagnosis
    
  
  
    
      0
      63.0
      1.0
      1.0
      145.0
      233.0
      1.0
      2.0
      150.0
      0.0
      2.3
      3.0
      0.0
      6.0
      0
    
    
      1
      67.0
      1.0
      4.0
      160.0
      286.0
      0.0
      2.0
      108.0
      1.0
      1.5
      2.0
      3.0
      3.0
      2
    
    
      2
      67.0
      1.0
      4.0
      120.0
      229.0
      0.0
      2.0
      129.0
      1.0
      2.6
      2.0
      2.0
      7.0
      1
    
    
      3
      37.0
      1.0
      3.0
      130.0
      250.0
      0.0
      0.0
      187.0
      0.0
      3.5
      3.0
      0.0
      3.0
      0
    
    
      4
      41.0
      0.0
      2.0
      130.0
      204.0
      0.0
      2.0
      172.0
      0.0
      1.4
      1.0
      0.0
      3.0
      0
    
    
      5
      56.0
      1.0
      2.0
      120.0
      236.0
      0.0
      0.0
      178.0
      0.0
      0.8
      1.0
      0.0
      3.0
      0
    
    
      6
      62.0
      0.0
      4.0
      140.0
      268.0
      0.0
      2.0
      160.0
      0.0
      3.6
      3.0
      2.0
      3.0
      3
    
    
      7
      57.0
      0.0
      4.0
      120.0
      354.0
      0.0
      0.0
      163.0
      1.0
      0.6
      1.0
      0.0
      3.0
      0
    
    
      8
      63.0
      1.0
      4.0
      130.0
      254.0
      0.0
      2.0
      147.0
      0.0
      1.4
      2.0
      1.0
      7.0
      2
    
    
      9
      53.0
      1.0
      4.0
      140.0
      203.0
      1.0
      2.0
      155.0
      1.0
      3.1
      3.0
      0.0
      7.0
      1

What sort of study was this?

A. Observational or experimental

Observational and experimental data

to know if your data is obersevational or experimental, you ask if the explanatory variable was manipulated or not

In a true experiment, 3 conditions:

only one variable is manipulated
we have a control group
random* assignment (analysis stage in randomized trials: check clss imbalances, and if any found consider incl. in model as explanatory variables, for statistical control). In theory, in this case one could determine causality.

Quasi experiment:

only one variable is manipulated
control group
no random assignment; groups pre selected. i.e. drug users study. To improve a quasi experimental design: add confounding variables; have a control group; use a pre-test/post-test design confounder=control variable=covariate=third variable=lurking variable

in an observational study the regression line only describes data you see. it cannot be used to predict result of intervention

*randomization works best as your sample size approaches infinity. for small sizes, imbalances in the groups can occur.

B. Longitudinal or cross sectional

Then data analysis? Wrong!

Background research. Always starts with background research. Domain knowledge.

We quickly find three studies using this dataset:

Detrano et al.: http://www.ajconline.org/article/0002-9149(89)90524-9/pdf Sundaram and Kakade: http://www2.rmcil.edu/dataanalytics/v2015/papers/Clinical_Decision_Support_For_Heart_Disease.pdf Soni et al.: http://www.enggjournals.com/ijcse/doc/IJCSE11-03-06-120.pdf

Value the time of domain experts. Dataset we'll focus on today. look at txt file AND at explanation

First stage in a study is always exploratory data analysis We typically receive the data without context, in a raw file that we cannot easily interpret The five steps of exploratory data analysis

?
organizing and summarizing the data
Looking for important features and patterns
Looking for exceptions
Interpreting these findings in the context of the research question at hand

This can be summarized in a notebook or even an initial data report for everyone involved in the data project



In [378]:

    
print(len(df))



In [379]:

    
print(len(df.columns))

number of observations to variables largely exceeds heuristic rule of 5 to 1
see Victoria Stodden (2006) for a discussion of why this is important
https://web.stanford.edu/~vcs/thesis.pdf

convert all variables to numeric ones



In [60]:

    
#mark all variables as numeric data, and signify, for the relevant ones, that they are categorical rather than 
#quantitative variables
#errors='coerce' tells pandas to return invalid values as NaN rather than as the input values themselves
#crucial to do this step otherwise subsequent analyses will not work properly. i.e. pandas would interpret missing 
#values as strings
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['sex'] = pd.to_numeric(df['sex'], errors='coerce').astype('category')
df['chest_pain'] = pd.to_numeric(df['chest_pain'], errors='coerce').astype('category')
df['rest_bp'] = pd.to_numeric(df['rest_bp'], errors='coerce')
df['cholesterol'] = pd.to_numeric(df['cholesterol'], errors='coerce')
df['fasting_bs'] = pd.to_numeric(df['fasting_bs'], errors='coerce').astype('category')
df['rest_ecg'] = pd.to_numeric(df['rest_ecg'], errors='coerce').astype('category')
df['max_heart_rate'] = pd.to_numeric(df['max_heart_rate'], errors='coerce')
df['exercise_angina'] = pd.to_numeric(df['exercise_angina'], errors='coerce').astype('category')
df['st_depression'] = pd.to_numeric(df['st_depression'], errors='coerce')
df['slope'] = pd.to_numeric(df['slope'], errors='coerce').astype('category')
df['fluoroscopy'] = pd.to_numeric(df['fluoroscopy'], errors='coerce').astype('category')
df['defect'] = pd.to_numeric(df['defect'], errors='coerce').astype('category')
df['diagnosis'] = pd.to_numeric(df['diagnosis'], errors='coerce').astype('category')

let's check for missing/outlier values

Why bother? Nearly all statistical methods assume complete information. In the presence of missing data:

parameter estimates may be biased*
statistical power weakens
precision of confidence intervals is diminished

Three types of missing variables:

Missing completely at random (MCAR): p(missing data on Y) independent of p(Y value) or p(other variables values) but: p(missing data on Y) may be linked to p(missing data on other variables in dataset))
Missing at random (MAR): p(missing data on Y) independent of value of Y after controlling for other variables
Not missing at random (NMAR)

If MAR, missing data is ignorable and there is no need to model missing data mechanism if NMAR, missing data mechanism is not ignorable and one must develop v good understanding of missing data process to model it

Standard options:

listwise deletion. works well if MCAR, which is rarely true, and can delete significant part of our sample
imputation. 2.a. marginal mean imputation: leads to biased estimates of variance and covariance and should be avoided 2.b. conditional mean imputation: we regress missing values on values of all other variables present in dataset. MCAR assumption. Generalized least squares usually shows good results. Overall issue with imputation methods: underestimate standard errors and overstimate test statistics.

Advanced options:

Multiple imputation
Maximum likelihood
Bayesian simulation
Hot deck (selects at random, with replacement, a value from observations with similar values for other variables)

For a great discussion of missing data: http://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf

Statsmodels and Scikit learn imputation functions: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html http://www.statsmodels.org/dev/imputation.html

At this point in time, pandas, scikit learn and statsmodels offer relatively poor methods of data imputation. You therefore will need to manually create your processes. But this will be worth your time!

"The only really good solution to the missing data problem is not to have any. So in the design and execution of research projects, it is essential to put great effort into minimizing the occurrence of missing data. Statistical adjustments can never make up for sloppy research." Paul Allison, 2001.

*Bias affects all measurements the same way, while chance errors vary from measurement to measurement therefore bias cannot be noticed by looking just at measurements, we need an external/theoretical benchmark as well. Bias is often discussed in conjunction with variance.

Now we go through each variable in our dataset one by one. We check for missing values if any, how many, and what % of total values they represent for each variable. For this we use: isnull() function checks for missing values value_counts() counts them we also look at what values our variables take, by applying the value_counts() function directly, without isnull()



In [5]:

    
df['age'].isnull().value_counts()









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-7467e92da54e> in <module>()
----> 1 df['age'].isnull().value_counts()

NameError: name 'df' is not defined



In [65]:

    
df['sex'].isnull().value_counts()









    Out[65]:





False    920
Name: sex, dtype: int64



In [66]:

    
df['sex'].value_counts()









    Out[66]:





1.0    726
0.0    194
Name: sex, dtype: int64



In [381]:

    
#the normalize = True parameter returns for us the % of total rather than the absolute number.
df['sex'].value_counts(normalize=True)









    Out[381]:





1.0    0.78913
0.0    0.21087
Name: sex, dtype: float64



In [67]:

    
df['chest_pain'].isnull().value_counts()









    Out[67]:





False    920
Name: chest_pain, dtype: int64



In [68]:

    
df['chest_pain'].value_counts()









    Out[68]:





4.0    496
3.0    204
2.0    174
1.0     46
Name: chest_pain, dtype: int64



In [69]:

    
df['rest_bp'].isnull().value_counts()









    Out[69]:





False    861
True      59
Name: rest_bp, dtype: int64



In [608]:

    
#we divide by len(df), to manually calculate the % of total observations that this represents each time.
df['rest_bp'].isnull().value_counts()/len(df)









    Out[608]:





False    0.93587
True     0.06413
Name: rest_bp, dtype: float64



In [71]:

    
df['cholesterol'].isnull().value_counts()









    Out[71]:





False    890
True      30
Name: cholesterol, dtype: int64



In [386]:

    
df['cholesterol'].isnull().value_counts()/len(df)









    Out[386]:





False    0.967391
True     0.032609
Name: cholesterol, dtype: float64



In [72]:

    
df['fasting_bs'].isnull().value_counts()









    Out[72]:





False    830
True      90
Name: fasting_bs, dtype: int64



In [387]:

    
df['fasting_bs'].isnull().value_counts()/len(df)









    Out[387]:





False    0.902174
True     0.097826
Name: fasting_bs, dtype: float64



In [541]:

    
#standard value_counts() function drops missing values. To avoid this you can add dropna=False argument to function.
df['fasting_bs'].value_counts()









    Out[541]:





0.0    692
1.0    138
Name: fasting_bs, dtype: int64



In [388]:

    
df['rest_ecg'].isnull().value_counts()/len(df)









    Out[388]:





False    0.997826
True     0.002174
Name: rest_ecg, dtype: float64



In [75]:

    
df['rest_ecg'].value_counts()









    Out[75]:





0.0    551
2.0    188
1.0    179
Name: rest_ecg, dtype: int64



In [76]:

    
df['max_heart_rate'].isnull().value_counts()









    Out[76]:





False    865
True      55
Name: max_heart_rate, dtype: int64



In [389]:

    
df['max_heart_rate'].isnull().value_counts()/len(df)









    Out[389]:





False    0.940217
True     0.059783
Name: max_heart_rate, dtype: float64



In [78]:

    
df['exercise_angina'].isnull().value_counts()









    Out[78]:





False    865
True      55
Name: exercise_angina, dtype: int64



In [390]:

    
df['exercise_angina'].isnull().value_counts()/len(df)









    Out[390]:





False    0.940217
True     0.059783
Name: exercise_angina, dtype: float64



In [79]:

    
df['exercise_angina'].value_counts()









    Out[79]:





0.0    528
1.0    337
Name: exercise_angina, dtype: int64



In [80]:

    
df['st_depression'].isnull().value_counts()









    Out[80]:





False    858
True      62
Name: st_depression, dtype: int64



In [391]:

    
df['st_depression'].isnull().value_counts()/len(df)









    Out[391]:





False    0.932609
True     0.067391
Name: st_depression, dtype: float64



In [82]:

    
df['slope'].isnull().value_counts()









    Out[82]:





False    611
True     309
Name: slope, dtype: int64



In [392]:

    
df['slope'].isnull().value_counts()/len(df)









    Out[392]:





False    0.66413
True     0.33587
Name: slope, dtype: float64



In [83]:

    
df['slope'].value_counts()









    Out[83]:





2.0    345
1.0    203
3.0     63
Name: slope, dtype: int64



In [84]:

    
df['fluoroscopy'].isnull().value_counts()









    Out[84]:





True     611
False    309
Name: fluoroscopy, dtype: int64



In [393]:

    
df['fluoroscopy'].isnull().value_counts()/len(df)









    Out[393]:





True     0.66413
False    0.33587
Name: fluoroscopy, dtype: float64



In [85]:

    
df['fluoroscopy'].value_counts()









    Out[85]:





0.0    181
1.0     67
2.0     41
3.0     20
Name: fluoroscopy, dtype: int64



In [86]:

    
df['defect'].isnull().value_counts()









    Out[86]:





True     486
False    434
Name: defect, dtype: int64



In [394]:

    
df['defect'].isnull().value_counts()/len(df)









    Out[394]:





True     0.528261
False    0.471739
Name: defect, dtype: float64



In [87]:

    
df['defect'].value_counts()









    Out[87]:





3.0    196
7.0    192
6.0     46
Name: defect, dtype: int64



In [88]:

    
df['diagnosis'].isnull().value_counts()









    Out[88]:





False    920
Name: diagnosis, dtype: int64



In [89]:

    
df['diagnosis'].value_counts()









    Out[89]:





0    411
1    265
2    109
3    107
4     28
Name: diagnosis, dtype: int64



In [395]:

    
len(df.columns)









    Out[395]:





14



In [90]:

    
#the variables slope, defect and fluoroscopy have 33-47% of missing values
#context. Why?
#can we correct for this without introducing other biases into our dataset?
#Here we will decide to eliminate these variables.
#Data analysts are mere mortals too. Cannot fix everything.
df_red = df[['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs', 'rest_ecg', 'max_heart_rate', 
            'exercise_angina', 'st_depression', 'diagnosis']]



In [126]:

    
len(df_red)









    Out[126]:





920



In [ ]:

    
#rest_bp, cholesterol, fasting_bs, rest_ecg, max_heart_rate, exercise_angina, st_depression



In [173]:

    
#rest_ecg is a categorical variable with only 2% of missing values.
#we make the choice to impute missing values with a straightforward method: the mode
#we are conscious this may introduce biases but due to low number of missing values, in what is
#a small range categorical variable (no extreme outliers possible) we feel comfortable doing this
df_red['rest_ecg'].fillna(df_red['rest_ecg'].mode().iloc[0], inplace=True)









    



//anaconda/lib/python3.6/site-packages/pandas/core/generic.py:3295: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)



In [174]:

    
df_red['rest_ecg'].isnull().value_counts()









    Out[174]:





False    920
Name: rest_ecg, dtype: int64



In [278]:

    
#let's see if there is any overlap between missing variables relative to observations, and what would happen if we 
#were to limit our dataset to observations with non missing values
df_clean = df_red[df_red['rest_bp'].notnull() & df_red['cholesterol'].notnull() & df_red['fasting_bs'].notnull() & df['max_heart_rate'].notnull() & df['exercise_angina'].notnull() &
                    df['st_depression'].notnull()]



In [396]:

    
len(df_clean)









    Out[396]:





741



In [189]:

    
df_clean.isnull().any()









    Out[189]:





age                False
sex                False
chest_pain         False
rest_bp            False
cholesterol        False
fasting_bs         False
rest_ecg           False
max_heart_rate     False
exercise_angina    False
st_depression      False
diagnosis          False
dtype: bool

this is not approach we would use in a real world study, but for exploratory purposes in this tutorial we can retain 85% of observations that were properly recorded regarding all variables I will add here an example with conditional mean imputation for one of the variables Statsmodels very limited in options = very manual work. From version 0.8.0 onwards MICE function.

now that we cleaned the data we can move on to univariate analysis - one variable at a time

data reporting tells you what is happening, but data analysis tells you why it is happening

Descriptive statistics

a parameter is calculated from the population, while a statistic from a sample

In studying samples, we assume the Central Limit Theorem holds: if you draw enough samples, from a population, and each sample is large enough, the distribution of the statistics of the samples will be normally distributed. normal curve discovered around 1720 by Abraham de Moivre. Around 1870, Adolph Quetelet thought of using it as the curve of an ideal histogram

center-spread-shape 3 measures of center: mean, median and mode 1 measure of spread: standard deviation 2 attributes of shape:

symmetry or skewness. skewed right (i.e. salaries), skewed left (i.e. age of natural deaths)
modality or peakness: unimodal, bimodal, uniform...

now we will aim to get an idea of the shape, center, and spread of these variables we will analyze the shape visually by checking for modality and skewness we will check for measures of center such as mean, median and mode we will check the spread through the standard deviation for quantitative variables

standard deviation: how far away observations are from their average in a normal distribution roughly 68% are within 1 SD and 95% within 2 SDs.

Quantitative variables shape, center and spread histogram

Categorical variables mode bar chart or frequency distribution

rules for visualizing data:

for visualizing one variable: if it is categorical we use a bar chart i.e. sns's countplot function if it is quantitative, we can combine a kernel density estimate and a histogram with sns's distplot function

for visualizing two variables: C-Q: bivariate bar graph with sns factorplot (bin/collapse explanatory variable), categories on x axis, and mean of response variable on y axis Q-Q: scatterplot with sns regplot C-C: you can plot them one a time. problem with a bivariate graph is that mean has no meaning in context of a categorical variable

for further reading: http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/datapresentation/DataPresentation7.html



In [522]:

    
#Gender
#given that gender is a categorical variable, we use a countplot to visualize it
#we always plot the variable of interest on the x axis, and the count or frequency on the y axis
sns.countplot(x='sex', data = df_clean)
plt.suptitle('Frequency of observations by gender')









    Out[522]:





<matplotlib.text.Text at 0x116ca35f8>

the results definitely suggest a discussion about class imbalance is in order here.



In [194]:

    
#Diagnosis
#we check the counts per each value of the variable. sort=False tells pandas not to sort the results by values. 
#normalize = True tells it to return the relative frequencies rather than the absolute counts
#if we had not cleaned the data, we could add parameter dropna=False so that value_counts does not drop null values
df_clean['diagnosis'].value_counts(sort=False, normalize=True)









    Out[194]:





0    0.481781
1    0.276653
2    0.106613
3    0.105263
4    0.029690
Name: diagnosis, dtype: float64



In [410]:

    
sns.countplot(x='diagnosis', data=df_clean)
plt.suptitle('Frequency distribution of diagnosis state')









    Out[410]:





<matplotlib.text.Text at 0x1137dfb00>



In [411]:

    
#Let's look at age now
#the describe request gives us the count, mean, std, min, max, as well as the quartiles for the
#respective value distribution
df_clean['age'].describe()









    Out[411]:





count    741.000000
mean      53.099865
std        9.402028
min       28.000000
25%       46.000000
50%       54.000000
75%       60.000000
max       77.000000
Name: age, dtype: float64



In [412]:

    
#sns distplot function combines matplotlib hist() function with sns kdeplot() function.
sns.distplot(df_clean['age'])
plt.suptitle('Distribution of age')









    



//anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[412]:





<matplotlib.text.Text at 0x11429c748>



In [417]:

    
df_clean['max_heart_rate'].describe()









    Out[417]:





count    741.000000
mean     138.740891
std       25.828809
min       60.000000
25%      120.000000
50%      140.000000
75%      159.000000
max      202.000000
Name: max_heart_rate, dtype: float64



In [420]:

    
sns.distplot(df_clean['max_heart_rate'])
plt.suptitle('Distribution of maximal heart rate')









    



//anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[420]:





<matplotlib.text.Text at 0x1153af5c0>



In [433]:

    
sns.swarmplot('diagnosis', data=df_clean)









    Out[433]:





<matplotlib.axes._subplots.AxesSubplot at 0x1158a2fd0>

rules for visualizing two variables (recap).

Their order does not matter here, hence Q-C and C-Q use the same visualization rule

C-Q: bivariate bar graph with sns factorplot, categories on x axis, and mean of response variable on y axis Q-Q: scatterplot with sns regplot C-C: you can plot them one a time. problem with a bivariate graph is that mean has no meaning in context of a categorical variable



In [611]:

    
#kind = bar asks for a bar graph and ci=None suppresses error bars
sns.factorplot(x='sex', y='age', kind='bar', data=df_clean, ci=None)
plt.suptitle('Gender vs age')









    Out[611]:





<matplotlib.text.Text at 0x116ccc240>



In [612]:

    
#categorical explanatory variable 'sex' and quantitative response variable 'rest_bp'
sns.factorplot(x='sex', y='rest_bp', data=df_clean, kind='bar')









    Out[612]:





<seaborn.axisgrid.FacetGrid at 0x116e33b00>



In [423]:

    
df_clean['chest_pain'].dtype









    Out[423]:





category



In [613]:

    
#We use a regplot to plot two quantitative variables, age and cholesterol, while also having a regression line suggesting
#any association present
#we always plot the explanatory variable on the x axis, and the response variable on the y axis
sns.regplot(x='age', y='cholesterol', data=df_clean)









    Out[613]:





<matplotlib.axes._subplots.AxesSubplot at 0x115bf69b0>



In [614]:

    
#We see that several observations have a cholesterol of "0". This will need to be investigated subsequently and treated
#as missing values



In [444]:

    
#how can we gain a better idea of how two categorical variables interact?
df_clean.groupby('sex')['diagnosis'].value_counts()/len(df)









    Out[444]:





sex  diagnosis
0.0  0            0.142391
     1            0.028261
     3            0.008696
     2            0.007609
     4            0.002174
1.0  0            0.245652
     1            0.194565
     2            0.078261
     3            0.076087
     4            0.021739
Name: diagnosis, dtype: float64



In [615]:

    
#common statistical boxplot visualization
sns.boxplot(x='cholesterol', y = 'sex', data=df_clean)
plt.suptitle('Cholesterol levels by gender')









    Out[615]:





<matplotlib.text.Text at 0x1172e3438>



In [617]:

    
#describing the dataset/the variables of the dataset, after we group the values in the dataset by diagnostic category
#the describe function gives us the key statistics for each quantitative variable in the dataset
df_clean.groupby('diagnosis').describe()









    Out[617]:






  
    
      
      
      age
      cholesterol
      max_heart_rate
      rest_bp
      st_depression
    
    
      diagnosis
      
      
      
      
      
      
    
  
  
    
      0
      count
      357.000000
      357.000000
      357.000000
      357.000000
      357.000000
    
    
      mean
      50.302521
      233.336134
      149.291317
      129.871148
      0.425210
    
    
      std
      9.417789
      66.596253
      23.643876
      16.566861
      0.712199
    
    
      min
      28.000000
      0.000000
      69.000000
      94.000000
      -0.500000
    
    
      25%
      43.000000
      202.000000
      135.000000
      120.000000
      0.000000
    
    
      50%
      51.000000
      230.000000
      152.000000
      130.000000
      0.000000
    
    
      75%
      57.000000
      269.000000
      168.000000
      140.000000
      0.800000
    
    
      max
      76.000000
      564.000000
      202.000000
      190.000000
      4.200000
    
    
      1
      count
      205.000000
      205.000000
      205.000000
      205.000000
      205.000000
    
    
      mean
      52.946341
      229.443902
      131.541463
      134.502439
      1.088293
    
    
      std
      8.577333
      101.127628
      23.681806
      19.045514
      1.021369
    
    
      min
      31.000000
      0.000000
      82.000000
      92.000000
      0.000000
    
    
      25%
      47.000000
      203.000000
      115.000000
      120.000000
      0.000000
    
    
      50%
      54.000000
      243.000000
      130.000000
      130.000000
      1.000000
    
    
      75%
      59.000000
      279.000000
      150.000000
      145.000000
      2.000000
    
    
      max
      75.000000
      603.000000
      195.000000
      200.000000
      5.000000
    
    
      2
      count
      79.000000
      79.000000
      79.000000
      79.000000
      79.000000
    
    
      mean
      58.430380
      182.607595
      127.797468
      135.392405
      1.394937
    
    
      std
      7.041422
      122.971168
      22.807095
      15.723099
      1.204787
    
    
      min
      42.000000
      0.000000
      60.000000
      100.000000
      -1.000000
    
    
      25%
      55.000000
      0.000000
      112.500000
      124.000000
      0.000000
    
    
      50%
      59.000000
      219.000000
      130.000000
      134.000000
      1.400000
    
    
      75%
      63.000000
      261.500000
      144.000000
      144.500000
      2.100000
    
    
      max
      74.000000
      409.000000
      180.000000
      180.000000
      4.000000
    
    
      3
      count
      78.000000
      78.000000
      78.000000
      78.000000
      78.000000
    
    
      mean
      59.179487
      178.205128
      122.769231
      136.102564
      1.598718
    
    
      std
      7.870285
      116.047165
      24.381664
      24.938673
      1.352006
    
    
      min
      39.000000
      0.000000
      63.000000
      0.000000
      0.000000
    
    
      25%
      54.000000
      0.000000
      105.750000
      122.250000
      0.500000
    
    
      50%
      60.000000
      219.500000
      120.000000
      135.000000
      1.500000
    
    
      75%
      64.000000
      267.750000
      140.000000
      150.000000
      2.150000
    
    
      max
      77.000000
      353.000000
      173.000000
      200.000000
      6.200000
    
    
      4
      count
      22.000000
      22.000000
      22.000000
      22.000000
      22.000000
    
    
      mean
      59.227273
      206.045455
      130.545455
      142.227273
      2.363636
    
    
      std
      8.286143
      112.888684
      24.595252
      21.080191
      1.283798
    
    
      min
      38.000000
      0.000000
      84.000000
      112.000000
      0.000000
    
    
      25%
      56.250000
      174.750000
      116.250000
      126.250000
      1.250000
    
    
      50%
      60.500000
      230.500000
      130.000000
      141.500000
      2.500000
    
    
      75%
      63.750000
      283.500000
      145.000000
      157.500000
      3.450000
    
    
      max
      77.000000
      407.000000
      182.000000
      190.000000
      4.400000



In [279]:

    
#before starting to manipulate the dataset itself, we make a copy, and will work on the copy rather than the original
df_clean_copy = df_clean.copy()

Now we can move to inferential statistics

We call them inferential statistics because we try to infer the population's paramters from our sample

The process is always the same and involves hypothesis testing:

define null hypothesis and alternate hypothesis
analyze evidence
interpret results

Typical H0: there is no relationship between the explanatory and response variable Typical H1: there is a statistically significant relationship

Bivariate statistical tools: Three main tools: ANOVA; chi-square; correlation coefficient

Type 1 vs Type 2 errors Type 1 error: the incorrect rejection of a true null hypothesis Type 2 error: retaining a false null hypothesis

we will test the null hypothesis that age and diagnosis are not related. the type of variables we have (explanatory/response and categorical/quantitative for each) determines the type of statistical tools we will use:

Explanatory categorical and response quantitative: ANOVA Explanatory categorical and response categorical: Chi Square test Explanatory quantitative and response categorical: classify/bin explanatory variable and use chi square test Explanatory quantitative and response quantitative: pearson correlation

a result is statistically significant if it is unlikely to be the result of chance

before performing these analyses, one needs to use the .dropna() function to include only valid data



In [207]:

    
#Going the wrong way?
#In regression analysis you can change your predictor and response variables. This is because they may be correlated,
#in which case X is correlated to Y and by definition Y is correlated with X. There is no directionality implied, which
#is also why you cannot talk of causation, but only of correlation.
test1 = smf.ols(formula = 'age ~ C(diagnosis)', data = df_clean_copy).fit()
print(test1.summary())

#*Whenever we have an explanatory categorical variable with more than two levels, we need to explicitly state this for
#statsmodels. Here we state it by adding 'C'and the variable name in parentheses. By definition, statsmodels then
#converts the first level of the variable into the reference group, therefore showing only n-1 levels in the results.
#each level is in brief a comparison to the reference group. Here the reference group becomes a diagnosis of 0.









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    age   R-squared:                       0.134
Model:                            OLS   Adj. R-squared:                  0.129
Method:                 Least Squares   F-statistic:                     28.42
Date:                Fri, 23 Jun 2017   Prob (F-statistic):           5.56e-22
Time:                        18:09:46   Log-Likelihood:                -2658.2
No. Observations:                 741   AIC:                             5326.
Df Residuals:                     736   BIC:                             5350.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept            50.3025      0.464    108.322      0.000        49.391    51.214
C(diagnosis)[T.1]     2.6438      0.769      3.438      0.001         1.134     4.153
C(diagnosis)[T.2]     8.1279      1.091      7.450      0.000         5.986    10.270
C(diagnosis)[T.3]     8.8770      1.097      8.095      0.000         6.724    11.030
C(diagnosis)[T.4]     8.9248      1.927      4.630      0.000         5.141    12.709
==============================================================================
Omnibus:                        2.010   Durbin-Watson:                   1.203
Prob(Omnibus):                  0.366   Jarque-Bera (JB):                1.920
Skew:                          -0.060   Prob(JB):                        0.383
Kurtosis:                       2.782   Cond. No.                         6.38
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What interests us in an ANOVA table are: R squared The F statistic The p value (here called 'Prob (F-statistic) The coefficients of each explanatory variable and the associated p value for it (here 'coef' and 'P>|t| columns) The 95% confidence interval for our coefficients. Here called '[95.0% Conf. Int.]

The 'R Squared' statistic. This is a measure of how much of the variability in the response variable, age here, is explained by our model. We see this here it is 0.134 only. So diagnostic group only helps us explain 13.4% of the variability in age. This can indicate that either we omit important explanatory variables that we can add to the model, or that we miss the structure of the association.

The F statistic: An F test is any statistical test in which the test statistic has an F distribution under the null hypothesis The F statistic = variation among sample means/variation within sample groups ANOVA F Test = are the differences among the sample means due to true differences among the population means, or merely due to sampling variability?

p value in an ANOVA table = probability of getting an F value as large or larger if H0 is true probability of finding this value if there is no difference between sample means. I would like to thank M.N. for sending me the below blog, which contains a couple of great posts on common statistical concepts:

http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-correctly-interpret-p-values

In an ANOVA table, we can decide if the relationship is statistically significant by checking the value of the F statistic against the relevant statistical table, F=28.42 above, or by looking at the p value. The latter is easiest. A long tradition in statistics is to consider a p value (also called alpha) below 0.05 as indicative of a significant relationship. When conducting such tests, you need to decide what p value/alpha value you feel comfortable with, and from then onwards you create a binary framework: either a statistic has an associated p value below the value you decided on at the beginning, or not. In this sense, a p value of 0.0001 is not indicative of a 5 times stronger relationship than a p value of 0.0005. In our case here a p value of 5.56e-22 means the relationship is statistically significant (5.56e-22 < 0.05).

The coefficients for each variable. Here for example our model would be: age = 50.3025 + 2.6438diagnosis1 + 8.1279 diagnosis2 + 8.8770 diagnosis3 + 8.9248 diagnosis4 We can also say that being having a diagnosis of 1 increases someone's age by 2.6438 assuming we hold all other explanatory variables in our model constant at 0. a negative but significant coefficient indicates that our explanatory variable and the respons variable are negatively associated: as one increases, the other decreases. the higher the coefficient, the more impact it will have on the value of our response variable based on our model. However, these coefficients result from our sample, meaning that the population parameters may differ from these. The confidence intervals give us a range in which these parameters can be for the population, with 95% confidence. For example we can be 95% confident that the population parameter for diagnosis1 is between 1.134 and 4.153.

Whenever the explanatory variable has more than 2 levels, we need to also perform post hoc statistical tests to better understand the relationship between the explanatory variable and the response variable we know the groups tested are different overall, but not exactly where/how they are different for explanatory variables with multiples levels, F test and p value do not tell us why the group means are not equal, or how. there are many ways in which this can be the case.

How are the response and explanatory variables associated per level of the explanatory variable?

post hoc tests aim to protect against type 1 error when explanatory variable is multilevel



In [209]:

    
#now we examine the means and standard deviations
grouped1_mean = df_clean_copy.groupby('diagnosis').mean()['age']
print(grouped1_mean)









    



diagnosis
0    50.302521
1    52.946341
2    58.430380
3    59.179487
4    59.227273
Name: age, dtype: float64



In [210]:

    
grouped1_std = df_clean_copy.groupby('diagnosis').std()['age']
print(grouped1_std)









    



diagnosis
0    9.417789
1    8.577333
2    7.041422
3    7.870285
4    8.286143
Name: age, dtype: float64



In [211]:

    
#given that we have an explanatory categorical variable with multiple levels, we use the 
#tuckey hsd test
#other tests:
#Holm T
#Least Significant Difference

tuckey1 = multi.MultiComparison(df_clean_copy['age'], df_clean_copy['diagnosis'])
res1 = tuckey1.tukeyhsd()
print(res1.summary())









    



Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  0      1     2.6438   0.5412  4.7464  True 
  0      2     8.1279   5.1446 11.1111  True 
  0      3     8.877    5.8781 11.8758  True 
  0      4     8.9248   3.654  14.1955  True 
  1      2     5.484    2.3067  8.6614  True 
  1      3     6.2331   3.0411  9.4252  True 
  1      4     6.2809   0.898  11.6639  True 
  2      3     0.7491  -3.0808  4.579  False 
  2      4     0.7969  -4.9872  6.5809 False 
  3      4     0.0478  -5.7443  5.8399 False 
---------------------------------------------

We will now test another hypothesis:
Hypothesis(0)(a): the presence of chest pain and the diagnosis (0 or 1) are independent
Alternative Hypothesis 1: presence of chest pain and diagnosis are not independent

Feature engineering

Paradox: you can get better results with great feature engineering and a poor model than with poor feature engineering but a great model

A feature is an attribute that is useful to your problem
Dr. Jason Brownlee "The algorithms we used are very standard for Kagglers...We spent most of our efforts in feature engineering.
Xavier Conort, #1 Kaggler in 2013

Aims to convert data attributes into data features

Aims to optimize data modelling

Requires understanding of the dataset and research problem, and understanding of model you plan on using can be domain driven, or data driven i.e. for SVM with linear kernel you need to manually construct nonlinear interactions between features and feed them as input to your SVM model. An SVM with polynomial kernel will naturally capture them. other example: SVMs are very sensitive to dimensions of features, while DT/RFs are not With tabular data you combine, aggregate, split and/or decompose features in order to create new ones Given an output y and a feature x, you can try the following transforms first: e^x, log(x), x^2, x^3 an indicator of the usefulness of the transformation is if the correlation between y and x' is higher than the correlation between y and x best way to validate this is to check your model error with or without the transformed feature http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/ http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/



In [6]:

    
diagnosis_dic = {0:0, 1:1, 2:1, 3:1, 4:1}
df_clean_copy['diagnosis_binary'] = df_clean_copy['diagnosis'].map(diagnosis_dic)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-23a8553fc454> in <module>()
      1 diagnosis_dic = {0:0, 1:1, 2:1, 3:1, 4:1}
----> 2 df_clean_copy['diagnosis_binary'] = df_clean_copy['diagnosis'].map(diagnosis_dic)

NameError: name 'df_clean_copy' is not defined



In [474]:

    
df_clean_copy['diagnosis_binary'].value_counts()









    Out[474]:





1    384
0    357
Name: diagnosis_binary, dtype: int64



In [222]:

    
#contingency table of observed counts
#the crosstab function allows us to cross one variable with another
#when creating contingency tables, we put the response variable first (therefore vertical in table), 
#and the explanatory variable second, therefore horizontal at the top of the table.
ct1 = pd.crosstab(df_clean_copy['diagnosis_binary'], df_clean_copy['chest_pain'])
print(ct1)









    



chest_pain        1.0  2.0  3.0  4.0
diagnosis_binary                    
0                  23  128  112   94
1                  15   22   49  298



In [223]:

    
#column percentages
colsum = ct1.sum(axis=0)
colpct = ct1/colsum
print(colpct)









    



chest_pain             1.0       2.0       3.0       4.0
diagnosis_binary                                        
0                 0.605263  0.853333  0.695652  0.239796
1                 0.394737  0.146667  0.304348  0.760204



In [224]:

    
#chi square test
#Expected counts: p assuming events are independent. p(1) * p(2) | column total*row total/table total
#Chi square statistic summarizes this. difference between our obersavtion and what we would expect if H0 is true
#We rely on the p value, as different distributions define whether the chi square itself is large or not

print('chi-square value, p value, expected counts')
cs1 = scipy.stats.chi2_contingency(ct1)
print(cs1)









    



chi-square value, p value, expected counts
(206.69693667917477, 1.5066891002705367e-44, 3, array([[  18.30769231,   72.26720648,   77.56680162,  188.8582996 ],
       [  19.69230769,   77.73279352,   83.43319838,  203.1417004 ]]))



In [526]:

    
#Explanatory variable with multiple levels!
#we would have to do a pairwise comparison between every two groups of the explanatory
#variable, vs the response variable
#This would be a Bonferroni adjustment - we adjust p value we use by number of pairwise comparisons, and test these.



In [225]:

    
df_clean_copy.columns









    Out[225]:





Index(['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs',
       'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression',
       'diagnosis', 'diagnosis_binary'],
      dtype='object')



In [226]:

    
ct2 = pd.crosstab(df_clean_copy['diagnosis_binary'], df_clean_copy['sex'])
print(ct2)









    



sex               0.0  1.0
diagnosis_binary          
0                 131  226
1                  43  341



In [227]:

    
#column percentages
colsum2 = ct2.sum(axis=0)
colpct2 = ct2/colsum2
print(colpct2)









    



sex                    0.0       1.0
diagnosis_binary                    
0                 0.752874  0.398589
1                 0.247126  0.601411



In [228]:

    
#chi square test
print('chi-square value, p value, expected counts')
cs2 = scipy.stats.chi2_contingency(ct2)
print(cs2)









    



chi-square value, p value, expected counts
(65.523823461676344, 5.741569147761909e-16, 1, array([[  83.82995951,  273.17004049],
       [  90.17004049,  293.82995951]]))



In [510]:

    
#Moderators
#a moderator is a third variable that affects the direction and/or strength between your explanatory and response variables
#the question is, is our response variable associated with our explanatory variable, for each level of our third variable?



In [240]:

    
#let's see if sex is a moderator in the statistically significant relationship between chest pain and diagnosis
df_clean_copy_men = df_clean_copy[df_clean_copy['sex'] == 0]
len(df_clean_copy_men)









    Out[240]:





174



In [241]:

    
df_clean_copy_women = df_clean_copy[df_clean_copy['sex'] == 1]
len(df_clean_copy_women)









    Out[241]:





567



In [242]:

    
#contingency table of observed counts
#when creating contingency tables, we put the response variable first (therefore vertical in table), 
#and the explanatory variable second, therefore horizontal at the top of the table.
ct3 = pd.crosstab(df_clean_copy_men['diagnosis_binary'], df_clean_copy_men['chest_pain'])
print(ct3)









    



chest_pain        1.0  2.0  3.0  4.0
diagnosis_binary                    
0                   9   47   44   31
1                   0    4    6   33



In [243]:

    
#column percentages
colsum = ct3.sum(axis=0)
colpct = ct3/colsum
print(colpct)









    



chest_pain        1.0       2.0   3.0       4.0
diagnosis_binary                               
0                 1.0  0.921569  0.88  0.484375
1                 0.0  0.078431  0.12  0.515625



In [244]:

    
#chi square test
print('chi-square value, p value, expected counts')
cs3 = scipy.stats.chi2_contingency(ct3)
print(cs3)









    



chi-square value, p value, expected counts
(39.896349270579876, 1.1207983387044401e-08, 3, array([[  6.77586207,  38.39655172,  37.64367816,  48.18390805],
       [  2.22413793,  12.60344828,  12.35632184,  15.81609195]]))



In [246]:

    
#contingency table of observed counts
#when creating contingency tables, we put the response variable first (therefore vertical in table), 
#and the explanatory variable second, therefore horizontal at the top of the table.
ct4 = pd.crosstab(df_clean_copy_women['diagnosis_binary'], df_clean_copy_women['chest_pain'])
print(ct4)









    



chest_pain        1.0  2.0  3.0  4.0
diagnosis_binary                    
0                  14   81   68   63
1                  15   18   43  265



In [247]:

    
#column percentages
colsum = ct4.sum(axis=0)
colpct = ct4/colsum
print(colpct)









    



chest_pain             1.0       2.0       3.0       4.0
diagnosis_binary                                        
0                 0.482759  0.818182  0.612613  0.192073
1                 0.517241  0.181818  0.387387  0.807927



In [248]:

    
#chi square test
print('chi-square value, p value, expected counts')
cs4 = scipy.stats.chi2_contingency(ct4)
print(cs4)









    



chi-square value, p value, expected counts
(153.13334878752136, 5.556477341199484e-33, 3, array([[  11.55908289,   39.46031746,   44.24338624,  130.7372134 ],
       [  17.44091711,   59.53968254,   66.75661376,  197.2627866 ]]))



In [535]:

    
df_clean_copy_women.groupby('chest_pain')['diagnosis'].value_counts()









    Out[535]:





chest_pain  diagnosis
1.0         0             14
            1              9
            2              3
            3              2
            4              1
2.0         0             81
            1             14
            2              2
            3              2
3.0         0             68
            1             22
            3             11
            2              8
            4              2
4.0         1            134
            0             63
            2             59
            3             55
            4             17
Name: diagnosis, dtype: int64

the relationship between chest pain and the diagnosis holds for both levels of the sex variables, hence it is not a moderator.
we would test for moderator variables in the case of a quantitative response variable the same way:

divide the population into the sublevels of the third variables
conduct an smf.ols test for each to see if the relationship is statistically significant for each level

identifying a confounding variable does not allow to establish causation, just to get closer to a causal connection. due to infinite number of possible lurking variables, observational studies cannot rly establish causation a lurking of confounding variable is a third variable that is associated with both the explanatory and response variables. i.e. x=firefighters; y=damage caused by a fire. plot would suggest more firefighters causes more fire damage. in reality there is a third confounding variable that influences both, seriousness of the fire. In a study we want to demonstrate that our statistical relationship is valid even after controlling for confounders.

now we will test whether there is a relationship between two quantitative variables, age and cholesterol for this we use the pearson correlation test r, going from -1 to 1 only tells us whether the two variables are linearly related. they may be related in nonlinear ways therefore it's always important to look at r in parallel with a scatterplot of the two variables r squared is a measure of how much variability in one variable can be explained by the other variable to calculate the pearson coefficient we need to remove all missing values Please remember that when two variables are correlated it is possible that: X causes Y or Y causes X Z causes both X and Y X and Y are correlated by chance - a spurious correlation



In [231]:

    
df_clean_copy.columns









    Out[231]:





Index(['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs',
       'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression',
       'diagnosis', 'diagnosis_binary'],
      dtype='object')



In [450]:

    
scat1 = sns.regplot(x='age', y = 'cholesterol', fit_reg=True, data = df_clean_copy)
plt.xlabel('Age')
plt.ylabel('Cholesterol')
plt.suptitle('Relationship between age and cholesterol')
scat1









    Out[450]:





<matplotlib.axes._subplots.AxesSubplot at 0x115abeac8>



In [630]:

    
#the r coefficient is a measure of association, of how closely points are clustered around a line
#correlations are always between -1 and 1
#r measures solely linear association
#association, not causation
#it is a number without units
#it can mislead in presence of outliers or non linear association -> always draw a scatter plot as well and check visually
#when you look at a scatter plot you look at direction form and strength of relationship
#if you identify a nonlinear association, with one or more curves, then you know you ought to add nonlinear explanatory
#variables to your regression model, to aim to capture this nonlinear association as well.
#ecological correlations based on averages can be misleading and overstate strength of associations for individual 
#units
print('association between age and cholesterol')
print(scipy.stats.pearsonr(df_clean_copy['age'], df_clean_copy['cholesterol']))









    



association between age and cholesterol
(-0.068169552180946086, 0.063640508704807752)

Linear regression:

multivariate linear regression for quantitative response variable logistic regression for binary categorical response variable

Assumptions of these types of models:

Normality: residuals from our linear regression model are normally distributed. if they are not, our model may be misspecified

Linearity: association between explanatory and response variable is linear

Homoscedasticity (or assumption of constant variance): variability in the response variable is the same at all levels of the explanatory variable. i.e. if residuals are spread around the regression line in a similar manner as you move along the x axis (values of the explanatory variable)

Independence: observations are not correlated with each other. Longitudinal data can violate this assumption, as well as hierarchical nesting/clustering data i.e. looking at students by classes. this assumption is the most serious to be violated, and also cannot be fixed by modifying the variables. the data structure itself is the problem.

We have to contend with:

Multicollinearity: explanatory variables are highly correlated with each other. this can mess up your parameter estimates or make them highly unstable. Signs:

highly associated variable not significant.
negative regression coefficient that should be positive
taking out an explanatory variable drastically changes the results

Outliers: can affect your regression line

multiple regression model allows us to find the relationship between one explanatory variable and the reponse variable, while controlling (holding constant at 0) all the other variables. for interpretability of our model, each variable needs to include a meaningful value of 0, so as to make it easier to interpret the coefficients (what does it mean to hold cholesterol constant at 0 if its range has no value of 0?) for a categorical variable, we can just recode one of the values to be 0 for a quantitative variable, we have to center it. Centering = subtracting the mean of a variable from the value of the variable. We are therefore recoding it so that its mean=0. if a quantitative explanatory variable includes a meaningful value of 0 already, we may not need to center it. in linear regression we only center explanatory variables not response one in logistic regression we always need to code response binary variable so that 0 means no outcome and 1 outcome occurred this is true whether outcome is positive or negative.

We will create a multiple regression model, investigating the relationship between our explanatory variables and our response variable diagnosis we will first center the explanatory variables. for categorical variables, one of the categories needs to be 0, for quantitative variables, we need to subtract the mean from each value.

Notes: do not center the response variable. is using logistic regression, do recode the binary response variable to make sure one class is coded as 0



In [262]:

    
df_clean_copy.columns









    Out[262]:





Index(['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs',
       'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression',
       'diagnosis', 'diagnosis_binary'],
      dtype='object')



In [270]:

    
#categorical variables: sex, chest_pain, fasting_bs, rest_ecg, exercise_angina
#quantitative variables: age, rest_bp, cholesterol, max_heart_rate, st_depression



In [280]:

    
df_clean_copy['chest_pain'].value_counts()









    Out[280]:





4.0    392
3.0    161
2.0    150
1.0     38
Name: chest_pain, dtype: int64



In [281]:

    
recode_chest_pain = {1:0, 2:1, 3:2, 4:3}
df_clean_copy['chest_pain_p'] = df_clean_copy['chest_pain'].map(recode_chest_pain)



In [282]:

    
df['fasting_bs'].value_counts()









    Out[282]:





0.0    692
1.0    138
Name: fasting_bs, dtype: int64



In [283]:

    
df['rest_ecg'].value_counts()









    Out[283]:





0.0    551
2.0    188
1.0    179
Name: rest_ecg, dtype: int64



In [284]:

    
df['exercise_angina'].value_counts()









    Out[284]:





0.0    528
1.0    337
Name: exercise_angina, dtype: int64



In [285]:

    
df_clean_copy['age_c'] = df_clean_copy['age'] - df_clean_copy['age'].mean()



In [286]:

    
df_clean_copy['rest_bp_c'] = df_clean_copy['rest_bp'] - df_clean_copy['rest_bp'].mean()



In [287]:

    
df_clean_copy['cholesterol_c'] = df_clean_copy['cholesterol'] - df_clean_copy['cholesterol'].mean()



In [288]:

    
df_clean_copy['max_heart_rate_c'] = df_clean_copy['max_heart_rate'] - df_clean_copy['max_heart_rate'].mean()



In [289]:

    
df_clean_copy['st_depression_c'] = df_clean_copy['st_depression'] - df_clean_copy['st_depression'].mean()



In [292]:

    
df_clean_copy.columns









    Out[292]:





Index(['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs',
       'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression',
       'diagnosis', 'chest_pain_p', 'age_c', 'rest_bp_c', 'cholesterol_c',
       'max_heart_rate_c', 'st_depression_c', 'diagnosis_binary'],
      dtype='object')



In [295]:

    
df_clean_copy_c = df_clean_copy[['age_c', 'sex', 'chest_pain_p', 'rest_bp_c', 'cholesterol_c',
                                       'fasting_bs', 'rest_ecg', 'max_heart_rate_c', 'exercise_angina',
                                       'st_depression_c', 'diagnosis_binary']]



In [296]:

    
df_clean_copy_c.columns









    Out[296]:





Index(['age_c', 'sex', 'chest_pain_p', 'rest_bp_c', 'cholesterol_c',
       'fasting_bs', 'rest_ecg', 'max_heart_rate_c', 'exercise_angina',
       'st_depression_c', 'diagnosis_binary'],
      dtype='object')



In [459]:

    
model1 = smf.ols(formula = 'age_c ~ sex', data = df_clean_copy_c).fit()
print(model1.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  age_c   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.8054
Date:                Tue, 27 Jun 2017   Prob (F-statistic):              0.370
Time:                        19:31:51   Log-Likelihood:                -2711.1
No. Observations:                 741   AIC:                             5426.
Df Residuals:                     739   BIC:                             5435.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.5596      0.713     -0.785      0.433        -1.959     0.840
sex[T.1.0]     0.7314      0.815      0.897      0.370        -0.868     2.331
==============================================================================
Omnibus:                       11.248   Durbin-Watson:                   1.086
Prob(Omnibus):                  0.004   Jarque-Bera (JB):                8.518
Skew:                          -0.160   Prob(JB):                       0.0141
Kurtosis:                       2.583   Cond. No.                         3.91
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We now add explanatory variables to our model one at a time. In doing this, we keep an eye on what impact adding an explanatory variable will have on our model. Sometimes for example we add a variable, and another variable that was statistically significant (had a p value below 0.05), suddenly becomes insignificant. This means the new variable is a confounder, a moderator, variable in the relationship between that other variable and the response variable.



In [461]:

    
model2 = smf.ols(formula = 'age_c ~ sex + cholesterol_c', data=df_clean_copy_c).fit()
print(model2.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  age_c   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     1.889
Date:                Tue, 27 Jun 2017   Prob (F-statistic):              0.152
Time:                        19:32:53   Log-Likelihood:                -2709.6
No. Observations:                 741   AIC:                             5425.
Df Residuals:                     738   BIC:                             5439.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept        -0.3644      0.721     -0.506      0.613        -1.780     1.051
sex[T.1.0]        0.4762      0.827      0.576      0.565        -1.148     2.100
cholesterol_c    -0.0065      0.004     -1.724      0.085        -0.014     0.001
==============================================================================
Omnibus:                       10.509   Durbin-Watson:                   1.107
Prob(Omnibus):                  0.005   Jarque-Bera (JB):                8.122
Skew:                          -0.158   Prob(JB):                       0.0172
Kurtosis:                       2.597   Cond. No.                         288.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [635]:

    
#here we will skip a couple of steps and create the overall model with all variables.



In [471]:

    
model3 = smf.ols(formula = 'age_c ~ sex + C(chest_pain_p) + (rest_bp_c) + cholesterol_c + fasting_bs + C(rest_ecg) + \
                            max_heart_rate_c + exercise_angina + st_depression_c + diagnosis_binary', data = df_clean_copy_c).fit()
print(model3.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  age_c   R-squared:                       0.278
Model:                            OLS   Adj. R-squared:                  0.266
Method:                 Least Squares   F-statistic:                     21.58
Date:                Tue, 27 Jun 2017   Prob (F-statistic):           1.25e-43
Time:                        20:35:33   Log-Likelihood:                -2590.6
No. Observations:                 741   AIC:                             5209.
Df Residuals:                     727   BIC:                             5274.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                 -0.9208      1.486     -0.619      0.536        -3.839     1.997
sex[T.1.0]                -1.5406      0.744     -2.071      0.039        -3.001    -0.080
C(chest_pain_p)[T.1]      -1.5128      1.507     -1.004      0.316        -4.471     1.446
C(chest_pain_p)[T.2]       0.5994      1.477      0.406      0.685        -2.300     3.499
C(chest_pain_p)[T.3]      -0.5931      1.446     -0.410      0.682        -3.432     2.246
fasting_bs[T.1.0]          3.9777      0.853      4.664      0.000         2.303     5.652
C(rest_ecg)[T.1.0]         1.8683      0.851      2.196      0.028         0.198     3.539
C(rest_ecg)[T.2.0]         3.9081      0.742      5.268      0.000         2.452     5.364
exercise_angina[T.1.0]     0.0511      0.769      0.066      0.947        -1.458     1.561
rest_bp_c                  0.0754      0.017      4.508      0.000         0.043     0.108
cholesterol_c             -0.0042      0.003     -1.251      0.211        -0.011     0.002
max_heart_rate_c          -0.1093      0.013     -8.152      0.000        -0.136    -0.083
st_depression_c            0.9892      0.319      3.097      0.002         0.362     1.616
diagnosis_binary           1.4426      0.796      1.813      0.070        -0.119     3.005
==============================================================================
Omnibus:                        5.205   Durbin-Watson:                   1.369
Prob(Omnibus):                  0.074   Jarque-Bera (JB):                3.958
Skew:                           0.047   Prob(JB):                        0.138
Kurtosis:                       2.654   Cond. No.                         870.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We identified some explanatory variables that are associated with age, but our model overall barely explains 27% of the variation in the response variable let's run some diagnostics



In [466]:

    
#Q-Q plot for normality
fig1 = sm.qqplot(model3.resid, line='r')
#red line represents residuals we would expect if model residuals were normally distributed
#our residuals below deviate somewhat from red line, especially at lower and higher quantiles, meaning they do not
#follow a normal distribution. This means the curvilinear association in our model is not fully explained by our model.
#We could add more explanatory variables in this case to try to better explain any curvilinear association.



In [632]:

    
#simple plot of residuals
stdres = pd.DataFrame(model3.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
#resid_pearson normalizes our model's residuals
#ls='none' means points will not be connected

#we expect most residuals to fall within 2sd of the mean. More than 2 indicate there are outliers, and more than 3 
#extreme outliers.
#if more than 1% of our observations have standardized residuals with an absolute value greater than 2.5, or more than 5%
#have one greater than or equal to 2, there is evidence that the fit of the model is poor. top cause of this is 
#ommission of important explanatory variables in our model.

#standardized residuals in linear regression will always be linear, and the line will be horizontal

#normalizing or standardizing residuals amounts to making them have a mean of 0 and sd of 1 so as to fit a normal 
#standard distribution

#if residuals show a strong pattern (up, down, polynomial) then it is a good indication of nonlinearity
#in the underlying relationship









    Out[632]:





<matplotlib.text.Text at 0x117457978>



In [469]:

    
#leverage plot
fig4 = sm.graphics.influence_plot(model3, size=8)
print(fig4)

#leverage, always between 0 and 1. At 0, a standardized residual has no influence on our model.
#leverage is a measure of how much influence a specific residual (and therefore observation) has on our model.
#we see that we have extreme outliers, but they are low leverage, meaning they do not have an undue influence on our
#estimation of the regression model.
#we have an observation that is both high leverage and and outlier, observation 33. We would need to further investigate
#the corresponding observation.









    



Figure(576x396)



In [470]:

    
#now let's focus on our actual response variable in the study
#since it is a binary variable, we will need to use a logistic regression model



In [298]:

    
lreg1 = smf.logit(formula = 'diagnosis_binary ~ C(chest_pain_p)', data = df_clean_copy_c).fit()
print(lreg1.summary())









    



Optimization terminated successfully.
         Current function value: 0.543711
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       diagnosis_binary   No. Observations:                  741
Model:                          Logit   Df Residuals:                      737
Method:                           MLE   Df Model:                            3
Date:                Sat, 24 Jun 2017   Pseudo R-squ.:                  0.2148
Time:                        13:48:58   Log-Likelihood:                -402.89
converged:                       True   LL-Null:                       -513.13
                                        LLR p-value:                 1.581e-47
========================================================================================
                           coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
Intercept               -0.4274      0.332     -1.288      0.198        -1.078     0.223
C(chest_pain_p)[T.1]    -1.3335      0.404     -3.299      0.001        -2.126    -0.541
C(chest_pain_p)[T.2]    -0.3992      0.373     -1.069      0.285        -1.131     0.333
C(chest_pain_p)[T.3]     1.5812      0.352      4.488      0.000         0.891     2.272
========================================================================================



In [454]:

    
#again, normally we would add variables one at a time, but here we will go faster.
lreg2 = smf.logit(formula = 'diagnosis_binary ~ age_c + sex + C(chest_pain_p) + cholesterol_c', data = df_clean_copy_c).fit()
print(lreg2.summary())









    



Optimization terminated successfully.
         Current function value: 0.488979
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       diagnosis_binary   No. Observations:                  741
Model:                          Logit   Df Residuals:                      734
Method:                           MLE   Df Model:                            6
Date:                Tue, 27 Jun 2017   Pseudo R-squ.:                  0.2939
Time:                        19:24:14   Log-Likelihood:                -362.33
converged:                       True   LL-Null:                       -513.13
                                        LLR p-value:                 3.728e-62
========================================================================================
                           coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
Intercept               -1.6537      0.408     -4.058      0.000        -2.452    -0.855
sex[T.1.0]               1.4468      0.233      6.216      0.000         0.991     1.903
C(chest_pain_p)[T.1]    -1.0153      0.430     -2.362      0.018        -1.858    -0.173
C(chest_pain_p)[T.2]    -0.3127      0.400     -0.781      0.435        -1.097     0.472
C(chest_pain_p)[T.3]     1.6327      0.378      4.315      0.000         0.891     2.374
age_c                    0.0583      0.011      5.538      0.000         0.038     0.079
cholesterol_c           -0.0015      0.001     -1.495      0.135        -0.004     0.000
========================================================================================



In [472]:

    
lreg3 = smf.logit(formula = 'diagnosis_binary ~ age_c + sex + C(chest_pain_p) + rest_bp_c + \
                  cholesterol_c + fasting_bs + C(rest_ecg) + max_heart_rate_c + \
                  exercise_angina + st_depression_c', data = df_clean_copy_c).fit()
print(lreg3.summary())









    



Optimization terminated successfully.
         Current function value: 0.421593
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       diagnosis_binary   No. Observations:                  741
Model:                          Logit   Df Residuals:                      727
Method:                           MLE   Df Model:                           13
Date:                Tue, 27 Jun 2017   Pseudo R-squ.:                  0.3912
Time:                        20:35:53   Log-Likelihood:                -312.40
converged:                       True   LL-Null:                       -513.13
                                        LLR p-value:                 1.100e-77
==========================================================================================
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                 -1.9267      0.465     -4.139      0.000        -2.839    -1.014
sex[T.1.0]                 1.3101      0.256      5.110      0.000         0.808     1.813
C(chest_pain_p)[T.1]      -0.5553      0.478     -1.162      0.245        -1.492     0.381
C(chest_pain_p)[T.2]      -0.1410      0.443     -0.318      0.750        -1.009     0.727
C(chest_pain_p)[T.3]       1.3009      0.430      3.023      0.003         0.458     2.144
fasting_bs[T.1.0]          0.5082      0.292      1.742      0.081        -0.063     1.080
C(rest_ecg)[T.1.0]        -0.0877      0.302     -0.290      0.772        -0.680     0.505
C(rest_ecg)[T.2.0]         0.2337      0.253      0.925      0.355        -0.261     0.729
exercise_angina[T.1.0]     1.0054      0.231      4.353      0.000         0.553     1.458
age_c                      0.0217      0.013      1.723      0.085        -0.003     0.046
rest_bp_c                  0.0062      0.006      1.089      0.276        -0.005     0.017
cholesterol_c             -0.0018      0.001     -1.576      0.115        -0.004     0.000
max_heart_rate_c          -0.0142      0.005     -3.119      0.002        -0.023    -0.005
st_depression_c            0.6158      0.114      5.402      0.000         0.392     0.839
==========================================================================================



In [302]:

    
#however, for logistic regression it makes much more sense to calculate the odds ratio
#this is because in binary logistic regression, we only deal with probabilities, of an outcome (response variable) 
#being 0 or 1.
#odds are calculated as the exponentiation of our coefficients as calculated in a normal ANOVA table.

#Odds ratio (OR) for an explanatory variable:
#if OR=1, there is no association meaningful association between explanatory and response variables
#if OR<1, the response variable becomes less likely as the explanatory one increases
#if OR>1, the response variable becomes more likely as the explanatory one increases
print('Odds Ratios')
print(np.exp(lreg3.params))
#Interpretation of OR
#Here we would say that based on our sample, women were 3.7 times more likely than men to have a diagnosis of 1.









    



Odds Ratios
Intercept                    0.145625
C(sex)[T.1.0]                3.706415
C(chest_pain_p)[T.1]         0.573894
C(chest_pain_p)[T.2]         0.868495
C(chest_pain_p)[T.3]         3.672428
C(fasting_bs)[T.1.0]         1.662336
C(rest_ecg)[T.1.0]           0.916029
C(rest_ecg)[T.2.0]           1.263268
C(exercise_angina)[T.1.0]    2.732984
age_c                        1.021933
rest_bp_c                    1.006246
cholesterol_c                0.998173
max_heart_rate_c             0.985909
st_depression_c              1.851057
dtype: float64



In [303]:

    
# odd ratios with 95% confidence intervals
params = lreg3.params
conf = lreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))
#we have 95% confidence that the sex odds ratio will be between 2.24 and 6.12 for the population.









    



                           Lower CI  Upper CI        OR
Intercept                  0.058480  0.362629  0.145625
C(sex)[T.1.0]              2.242497  6.125988  3.706415
C(chest_pain_p)[T.1]       0.224975  1.463965  0.573894
C(chest_pain_p)[T.2]       0.364589  2.068857  0.868495
C(chest_pain_p)[T.3]       1.580165  8.535010  3.672428
C(fasting_bs)[T.1.0]       0.938503  2.944433  1.662336
C(rest_ecg)[T.1.0]         0.506524  1.656603  0.916029
C(rest_ecg)[T.2.0]         0.769914  2.072760  1.263268
C(exercise_angina)[T.1.0]  1.737951  4.297705  2.732984
age_c                      0.997020  1.047468  1.021933
rest_bp_c                  0.995035  1.017583  1.006246
cholesterol_c              0.995907  1.000445  0.998173
max_heart_rate_c           0.977157  0.994740  0.985909
st_depression_c            1.480481  2.314391  1.851057



In [305]:

    
#statsmodels offers significantly less residuals analysis options for logistic regression compared to linear one
#yet there is no reason we cannot get information by studying the residuals
#simple plot of residuals
stdres = pd.DataFrame(lreg3.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
#resid_pearson normalizes our model's residuals
#ls='none' means points will not be connected
#we expect most residuals to fall within 2sd of the mean. More than 2 are outliers, and more than 3 extreme outliers.
#if more than 1% of our observations have standardized residuals with an absolute value greater than 2.5, or more than 5%
#have one greater than or equal to 2, there is evidence that the fit of the model is poor. largest cause of this is ommission
#of important explanatory variables in our model.
#standardized residuals in linear regression will always be linear, and the line will be horizontal
#if residuals show a strong pattern (up, down, polynomial) then it is a good indication of nonlinearity
#in the underlying relationship









    Out[305]:





<matplotlib.text.Text at 0x1137df2e8>



In [673]:

    
#Statsmodels and scikit learn still offer few and poor 'off the shelf' options for imputing missing values in more
#statistically sound ways.
#As of statsmodels 0.8.0 at least, statsmodels offers the MICE imputation function
#a description is available here: http://www.statsmodels.org/dev/imputation.html
#and the details behind the implementation are here: 
#http://www.statsmodels.org/dev/_modules/statsmodels/imputation/mice.html
#to be able to use MICE you will need to update the statsmodels coming with Anaconda, to 0.8.0
#I am yet to find nice tutorials/examples of the mice function
#Let's create here a dataset with some missing values for one of our variables
df_clean_mice = df_red[df_red['rest_bp'].notnull() & df_red['cholesterol'].notnull() & df_red['fasting_bs'].notnull() & \
                  df['max_heart_rate'].notnull() & df['exercise_angina'].notnull()]



In [674]:

    
df_clean_mice.isnull().any()









    Out[674]:





age                False
sex                False
chest_pain         False
rest_bp            False
cholesterol        False
fasting_bs         False
rest_ecg           False
max_heart_rate     False
exercise_angina    False
st_depression       True
diagnosis          False
dtype: bool



In [497]:

    
df_clean_mice.describe()









    Out[497]:






  
    
      
      age
      rest_bp
      cholesterol
      max_heart_rate
      st_depression
    
  
  
    
      count
      745.000000
      745.000000
      745.000000
      745.000000
      741.000000
    
    
      mean
      53.130201
      132.771812
      219.923490
      138.817450
      0.893117
    
    
      std
      9.392743
      18.599746
      93.713004
      25.825905
      1.086922
    
    
      min
      28.000000
      0.000000
      0.000000
      60.000000
      -1.000000
    
    
      25%
      46.000000
      120.000000
      197.000000
      120.000000
      0.000000
    
    
      50%
      54.000000
      130.000000
      231.000000
      140.000000
      0.500000
    
    
      75%
      60.000000
      140.000000
      271.000000
      160.000000
      1.500000
    
    
      max
      77.000000
      200.000000
      603.000000
      202.000000
      6.200000



In [694]:

    
#we create a dataframe with only quantitative variables
df_clean_mice = df_clean_mice[['age', 'rest_bp', 'cholesterol', 'max_heart_rate', 'st_depression']]



In [695]:

    
import statsmodels.imputation.mice as mice
import statsmodels
from statsmodels.base.model import LikelihoodModelResults
from statsmodels.regression.linear_model import OLS
from collections import defaultdict
#we wrap our dataframe in a MICEData object
imp = mice.MICEData(df_clean_mice)
#we specify our analysis model
formula_mice = 'diagnosis_binary ~ age + rest_bp + cholesterol + max_heart_rate + st_depression_c'
#We now take the MICEData object, and the formula specification and run both the imputation and the analysis
mice = mice.MICE(formula_mice, smf.logit, imp)
#various plots and summary statistics are available and/or being developed with the MICE package.

Machine learning for data exploration

For me machine learning is simply the branch of statistics that has traditionally focused on working with large, heterogenous datasets, characterized by numerous variables that interact in nonlinear ways. It is a collection of tools, just like statistics in general.

An amazing article on traditional parametric statistics vs machine learning is: "Statistical Modeling: The Two Cultures," Leo Breiman, 2001.

Machine learning can be used for:

regression
classification
feature engineering and/or selection

Accuracy = test error rate. The rate at which an algorithm correctly classifies or estimates. Goal is to minimize test error rate. (in linear regression, which we saw before, accuracy was the mean squared error) (in logistic regression: accuracy = how well the model classifies observations)

Supervised vs unsupervised learning In supervised learning we work with labeled data In unsupervised learning we aim to find patterns in unlabeled data

In machine learning we will regularly face the bias-variance trade off: Variance = change in parameter estimates across different data sets Bias = how far off model estimated values are from true values ideally we want low variance and low bias, but they are negatively associated. As one decreases, the other increases. Generally, as complexity of model increases, this leads to higher variance and lower bias Simple models will have lower variance, but also be more biased.

We will briefly apply three machine learning algorithms here, which we could use in our exploratory data analysis:

Decision trees Random forests Support vector machines

I intend to expand this section in the future. It is difficult today to not take advantage of some of the machine learning tools available, including when it comes to data exploration We often forget that some key applications of machine learning tools are to help us gain insights faster than with more manual methods, and in feature engineering and/or selection

In the data exploration phase, you would use very raw, off the shelf, machine learning algorithms, simply for exploratory/descriptive purposes

Decision trees

Note: decision trees cannot handle missing data!



In [553]:

    
df_clean_copy.columns









    Out[553]:





Index(['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs',
       'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression',
       'diagnosis', 'chest_pain_p', 'age_c', 'rest_bp_c', 'cholesterol_c',
       'max_heart_rate_c', 'st_depression_c', 'diagnosis_binary'],
      dtype='object')



In [566]:

    
#although decision trees and random forests are particularly suitable for dealing with categorical variables, and in
#many statistical packages we can input the categorical variables directly, because of how they are implemented in
#scikit learn, in this package we cannot input categorical variables directly. We first have to encode them in a 
#process called one hot encoding. This creates a separate column (variable) for each value of our explanatory
#categorical variable. For this we use pandas' get_dummies function

df_sex = pd.get_dummies(df_clean_copy['sex'], prefix = 'sex')



In [569]:

    
df_chest_pain = pd.get_dummies(df_clean_copy['chest_pain'], prefix = 'chest_pain')
df_fasting_bs = pd.get_dummies(df_clean_copy['fasting_bs'], prefix = 'fasting_bs')
df_rest_ecg = pd.get_dummies(df_clean_copy['rest_ecg'], prefix = 'rest_ecg')
df_exercise_angina = pd.get_dummies(df_clean_copy['exercise_angina'], prefix='exercise_angina')



In [570]:

    
df_merged = pd.concat([df_clean_copy, df_sex, df_chest_pain, df_fasting_bs, df_rest_ecg, df_exercise_angina], axis=1)



In [571]:

    
df_merged.columns









    Out[571]:





Index(['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs',
       'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression',
       'diagnosis', 'chest_pain_p', 'age_c', 'rest_bp_c', 'cholesterol_c',
       'max_heart_rate_c', 'st_depression_c', 'diagnosis_binary', 'sex_0.0',
       'sex_1.0', 'chest_pain_1.0', 'chest_pain_2.0', 'chest_pain_3.0',
       'chest_pain_4.0', 'fasting_bs_0.0', 'fasting_bs_1.0', 'rest_ecg_0.0',
       'rest_ecg_1.0', 'rest_ecg_2.0', 'exercise_angina_0.0',
       'exercise_angina_1.0'],
      dtype='object')



In [574]:

    
df_dt = df_merged[['age', 'sex_0.0', 'sex_1.0', 'chest_pain_1.0', 'chest_pain_2.0', 'chest_pain_3.0', \
                       'chest_pain_4.0', 'rest_bp', 'cholesterol', 'fasting_bs_0.0', 'fasting_bs_1.0', \
                      'rest_ecg_1.0', 'rest_ecg_2.0', 'max_heart_rate', 'exercise_angina_0.0', 'exercise_angina_1.0',\
                       'st_depression', 'diagnosis_binary']]



In [575]:

    
#here we select as predictors all variables in our dataset expect for the response one
predictors = df_dt.ix[:, df_dt.columns != 'diagnosis_binary']



In [577]:

    
predictors.head(5)









    Out[577]:






  
    
      
      age
      sex_0.0
      sex_1.0
      chest_pain_1.0
      chest_pain_2.0
      chest_pain_3.0
      chest_pain_4.0
      rest_bp
      cholesterol
      fasting_bs_0.0
      fasting_bs_1.0
      rest_ecg_1.0
      rest_ecg_2.0
      max_heart_rate
      exercise_angina_0.0
      exercise_angina_1.0
      st_depression
    
  
  
    
      0
      63.0
      0
      1
      1
      0
      0
      0
      145.0
      233.0
      0
      1
      0
      1
      150.0
      1
      0
      2.3
    
    
      1
      67.0
      0
      1
      0
      0
      0
      1
      160.0
      286.0
      1
      0
      0
      1
      108.0
      0
      1
      1.5
    
    
      2
      67.0
      0
      1
      0
      0
      0
      1
      120.0
      229.0
      1
      0
      0
      1
      129.0
      0
      1
      2.6
    
    
      3
      37.0
      0
      1
      0
      0
      1
      0
      130.0
      250.0
      1
      0
      0
      0
      187.0
      1
      0
      3.5
    
    
      4
      41.0
      1
      0
      0
      1
      0
      0
      130.0
      204.0
      1
      0
      0
      1
      172.0
      1
      0
      1.4



In [578]:

    
#we select as the target our response variable
target = df_dt['diagnosis_binary']



In [579]:

    
#we create our training and testing datasets. Each will have its predictors and its target (response variable).
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size = 0.4)



In [580]:

    
print(pred_train.shape, pred_test.shape, tar_train.shape, tar_test.shape)









    



(444, 17) (297, 17) (444,) (297,)



In [581]:

    
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)



In [582]:

    
predictions = classifier.predict(pred_test)



In [583]:

    
sklearn.metrics.confusion_matrix(tar_test, predictions)









    Out[583]:





array([[100,  46],
       [ 35, 116]])



In [664]:

    
#let's look at which of our variables it considers most important:
print(classifier.feature_importances_)









    



[ 0.10711089  0.03075379  0.02813197  0.01287773  0.03156475  0.02747934
  0.10905395  0.09986029  0.11723441  0.01573098  0.02055796  0.01537355
  0.01416915  0.12507431  0.06819212  0.06646467  0.11037014]



In [645]:

    
sklearn.metrics.accuracy_score(tar_test, predictions)









    Out[645]:





0.78787878787878785

Random forests

Random forests, an ensemble learning method, are a more sophisticated method of using trees Pioneered by Tin Kam Ho, Leo Breiman (who also created 'bagging') and Adele Cutler For a full discussion, please see https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html



In [653]:

    
classifier2 = RandomForestClassifier(n_estimators = 25)
classifier2 = classifier2.fit(pred_train, tar_train)



In [654]:

    
predictions2 = classifier2.predict(pred_test)



In [655]:

    
sklearn.metrics.confusion_matrix(tar_test, predictions2)









    Out[655]:





array([[102,  44],
       [ 32, 119]])



In [672]:

    
#let's look at which of our variables the random forest considers most important:
#(these follow the order in which we input them to the random forest i.e. our dataset)
print(classifier2.feature_importances_)









    



[ 0.10483732  0.02897734  0.02174692  0.01403641  0.04168387  0.02074735
  0.09862823  0.09633414  0.10519325  0.01756028  0.01531064  0.00897264
  0.01828645  0.14053416  0.03421658  0.10297459  0.12995983]



In [656]:

    
sklearn.metrics.accuracy_score(tar_test, predictions2)









    Out[656]:





0.74410774410774416

random forests typically outperform decision trees, in particular when our decision tree model exhibits instability the lower accuracy here is perhaps a result of our implementation of the random forest - we would need to tune its parameters in order to optimize it for the problem at hand

Support vector machines

SVMs are a set of supervised learning methods, used for classification and regression They handle nonlinearity, through kernalization, much better than a standard logistic classifier

A couple of very good explanations: http://www.kdnuggets.com/2016/07/support-vector-machines-simple-explanation.html http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf http://scikit-learn.org/stable/modules/svm.html



In [657]:

    
from sklearn import svm
classifier3 = svm.SVC()
classifier3 = classifier3.fit(pred_train, tar_train)
predictions3 = classifier3.predict(pred_test)



In [658]:

    
sklearn.metrics.confusion_matrix(tar_test, predictions3)









    Out[658]:





array([[  8, 138],
       [  4, 147]])



In [659]:

    
sklearn.metrics.accuracy_score(tar_test, predictions3)









    Out[659]:





0.52188552188552184

we obtain a relatively low accuracy - once more, as for random forests, the more complex a machine learning tool is, the more work we will need to do in order to adapt it to our problem and hopefully reach a high accuracy for our model

How to go further

MOOCS

This is perhaps the best introductory data analysis course I found online, and will cover many of the topics covered here in more depth, covering both SAS and Pandas https://www.coursera.org/learn/data-visualization

A similarly good course but in R: https://www.coursera.org/specializations/jhu-data-science

Another course I really like is Bill Chambers' Udemy one: https://www.udemy.com/data-analysis-in-python-with-pandas/

For machine learning algorithms and their implementation: https://www.coursera.org/specializations/machine-learning

Statistics books - classics:

https://www.amazon.com/Statistics-4th-David-Freedman/dp/0393929728/ref=sr_1_1?ie=UTF8&qid=1499250394&sr=8-1&keywords=David+freedman+statistics

https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370/ref=sr_1_1?ie=UTF8&qid=1499250450&sr=8-1&keywords=an+introduction+to+statistical+learning

https://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576/ref=sr_1_5?ie=UTF8&qid=1499250450&sr=8-5&keywords=an+introduction+to+statistical+learning

Machine learning readings

https://www.amazon.com/Machine-Learning-Python-Techniques-Predictive/dp/1118961749/ref=sr_1_1?ie=UTF8&qid=1499250696&sr=8-1&keywords=machine+learning+in+python+bowles https://www.amazon.com/Machine-Learning-Hands-Developers-Professionals/dp/1118889061/ref=sr_1_1?ie=UTF8&qid=1499250725&sr=8-1&keywords=Machine+learning+hands+on+for+developers+and+technical+professionals+Bell https://www.amazon.com/Fundamentals-Machine-Learning-Predictive-Analytics/dp/0262029448/ref=sr_1_1?ie=UTF8&qid=1499251042&sr=8-1&keywords=fundamentals+of+machine+learning

Following R.S.'s question on how to get more practice/go further after this tutorial: I genuinely believe that there is no teacher like experience. What I find most useful, is to go to a large online repository of datasets i.e.

https://archive.ics.uci.edu/ml/datasets.html

https://bigml.com/gallery/datasets

Choose datasets that 1. focus on a topic that you find interesting, and 2. are relatively clean i.e. it’s very frustrating to spend a lot of time just cleaning data - that is an exercise in itself, and something you can learn by itself at your convenience (unless you are already a pro at it of course).

And then take one dataset from the sources above each week, and analyze it: Import it, visualize it, manage your data, analyze it with regression, try to improve your models with feature engineering, and then apply any machine learning methods to it as well. And every week make sure you do something new/add something i.e. one different type of graph, one different type of residuals test for your regression, (feature engineering by definition will be different for each dataset), one new machine learning model.

Document each one of them, build a notebook for each, and upload them to your github/portfolio!

I genuinely believe this is the best, and the most ‘realistic’ way of learning and becoming an applied data analyst as well.

Once you are comfortable with such tasks, you can of course try to compete in Kaggle for example, whether to replicate past performances or in current competitions.

Thank you and please don't hesitate to contact me if I can assist you further.

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	63.0	1.0	1.0	145.0	233.0	1.0	2.0	150.0	0.0	2.3	3.0	0.0	6.0	0
1	67.0	1.0	4.0	160.0	286.0	0.0	2.0	108.0	1.0	1.5	2.0	3.0	3.0	2
2	67.0	1.0	4.0	120.0	229.0	0.0	2.0	129.0	1.0	2.6	2.0	2.0	7.0	1
3	37.0	1.0	3.0	130.0	250.0	0.0	0.0	187.0	0.0	3.5	3.0	0.0	3.0	0
4	41.0	0.0	2.0	130.0	204.0	0.0	2.0	172.0	0.0	1.4	1.0	0.0	3.0	0
5	56.0	1.0	2.0	120.0	236.0	0.0	0.0	178.0	0.0	0.8	1.0	0.0	3.0	0
6	62.0	0.0	4.0	140.0	268.0	0.0	2.0	160.0	0.0	3.6	3.0	2.0	3.0	3
7	57.0	0.0	4.0	120.0	354.0	0.0	0.0	163.0	1.0	0.6	1.0	0.0	3.0	0
8	63.0	1.0	4.0	130.0	254.0	0.0	2.0	147.0	0.0	1.4	2.0	1.0	7.0	2
9	53.0	1.0	4.0	140.0	203.0	1.0	2.0	155.0	1.0	3.1	3.0	0.0	7.0	1

	0	1	2	3	4	6	7	10	11	12
0	28	1	2	130	132	2	185	?	?	?
1	29	1	2	120	243	0	160	?	?	?
2	29	1	2	140	?	0	170	?	?	?
3	30	0	1	170	237	1	170	?	?	6
4	31	0	2	100	219	1	150	?	?	?
5	32	0	2	105	198	0	165	?	?	?
6	32	1	2	110	225	0	184	?	?	?
7	32	1	2	125	254	0	155	?	?	?
8	33	1	3	120	298	0	185	?	?	?
9	34	0	2	130	161	0	190	?	?	?

	0	1	2	3	5	6	7	8	9	10	11	12	13
0	32	1	1	95	?	0	127	0	.7	1	?	?	1
1	34	1	4	115	?	?	154	0	.2	1	?	?	1
2	35	1	4	?	?	0	130	1	?	?	?	7	3
3	36	1	4	110	?	0	125	1	1	2	?	6	1
4	38	0	4	105	?	0	166	0	2.8	1	?	?	2
5	38	0	4	110	0	0	156	0	0	2	?	3	1
6	38	1	3	100	?	0	179	0	-1.1	1	?	?	0
7	38	1	3	115	0	0	128	1	0	2	?	7	1
8	38	1	4	135	?	0	150	0	0	?	?	3	2
9	38	1	4	150	?	0	120	1	?	?	?	3	1

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	63	1	4	140	260	0	1	112	1	3	2	?	?	2
1	44	1	4	130	209	0	1	127	0	0	?	?	?	0
2	60	1	4	132	218	0	1	140	1	1.5	3	?	?	2
3	55	1	4	142	228	0	1	149	1	2.5	1	?	?	1
4	66	1	3	110	213	1	2	99	1	1.3	2	?	?	0
5	66	1	3	120	0	0	1	120	0	-0.5	1	?	?	0
6	65	1	4	150	236	1	1	105	1	0	?	?	?	3
7	60	1	3	180	0	0	1	140	1	1.5	2	?	?	0
8	60	1	3	120	0	?	0	141	1	2	1	?	?	3
9	60	1	2	160	267	1	1	157	0	0.5	2	?	?	1

		age	cholesterol	max_heart_rate	rest_bp	st_depression
diagnosis
0	count	357.000000	357.000000	357.000000	357.000000	357.000000
	mean	50.302521	233.336134	149.291317	129.871148	0.425210
	std	9.417789	66.596253	23.643876	16.566861	0.712199
	min	28.000000	0.000000	69.000000	94.000000	-0.500000
	25%	43.000000	202.000000	135.000000	120.000000	0.000000
	50%	51.000000	230.000000	152.000000	130.000000	0.000000
	75%	57.000000	269.000000	168.000000	140.000000	0.800000
	max	76.000000	564.000000	202.000000	190.000000	4.200000
1	count	205.000000	205.000000	205.000000	205.000000	205.000000
	mean	52.946341	229.443902	131.541463	134.502439	1.088293
	std	8.577333	101.127628	23.681806	19.045514	1.021369
	min	31.000000	0.000000	82.000000	92.000000	0.000000
	25%	47.000000	203.000000	115.000000	120.000000	0.000000
	50%	54.000000	243.000000	130.000000	130.000000	1.000000
	75%	59.000000	279.000000	150.000000	145.000000	2.000000
	max	75.000000	603.000000	195.000000	200.000000	5.000000
2	count	79.000000	79.000000	79.000000	79.000000	79.000000
	mean	58.430380	182.607595	127.797468	135.392405	1.394937
	std	7.041422	122.971168	22.807095	15.723099	1.204787
	min	42.000000	0.000000	60.000000	100.000000	-1.000000
	25%	55.000000	0.000000	112.500000	124.000000	0.000000
	50%	59.000000	219.000000	130.000000	134.000000	1.400000
	75%	63.000000	261.500000	144.000000	144.500000	2.100000
	max	74.000000	409.000000	180.000000	180.000000	4.000000
3	count	78.000000	78.000000	78.000000	78.000000	78.000000
	mean	59.179487	178.205128	122.769231	136.102564	1.598718
	std	7.870285	116.047165	24.381664	24.938673	1.352006
	min	39.000000	0.000000	63.000000	0.000000	0.000000
	25%	54.000000	0.000000	105.750000	122.250000	0.500000
	50%	60.000000	219.500000	120.000000	135.000000	1.500000
	75%	64.000000	267.750000	140.000000	150.000000	2.150000
	max	77.000000	353.000000	173.000000	200.000000	6.200000
4	count	22.000000	22.000000	22.000000	22.000000	22.000000
	mean	59.227273	206.045455	130.545455	142.227273	2.363636
	std	8.286143	112.888684	24.595252	21.080191	1.283798
	min	38.000000	0.000000	84.000000	112.000000	0.000000
	25%	56.250000	174.750000	116.250000	126.250000	1.250000
	50%	60.500000	230.500000	130.000000	141.500000	2.500000
	75%	63.750000	283.500000	145.000000	157.500000	3.450000
	max	77.000000	407.000000	182.000000	190.000000	4.400000

	age	rest_bp	cholesterol	max_heart_rate	st_depression
count	745.000000	745.000000	745.000000	745.000000	741.000000
mean	53.130201	132.771812	219.923490	138.817450	0.893117
std	9.392743	18.599746	93.713004	25.825905	1.086922
min	28.000000	0.000000	0.000000	60.000000	-1.000000
25%	46.000000	120.000000	197.000000	120.000000	0.000000
50%	54.000000	130.000000	231.000000	140.000000	0.500000
75%	60.000000	140.000000	271.000000	160.000000	1.500000
max	77.000000	200.000000	603.000000	202.000000	6.200000

	0	1	2	3	4	6	7	10	11	12
0	28	1	2	130	132	2	185	?	?	?
1	29	1	2	120	243	0	160	?	?	?
2	29	1	2	140	?	0	170	?	?	?
3	30	0	1	170	237	1	170	?	?	6
4	31	0	2	100	219	1	150	?	?	?
5	32	0	2	105	198	0	165	?	?	?
6	32	1	2	110	225	0	184	?	?	?
7	32	1	2	125	254	0	155	?	?	?
8	33	1	3	120	298	0	185	?	?	?
9	34	0	2	130	161	0	190	?	?	?

	0	1	2	3	4	6	7	10	11	12
0	28	1	2	130	132	2	185	?	?	?
1	29	1	2	120	243	0	160	?	?	?
2	29	1	2	140	?	0	170	?	?	?
3	30	0	1	170	237	1	170	?	?	6
4	31	0	2	100	219	1	150	?	?	?
5	32	0	2	105	198	0	165	?	?	?
6	32	1	2	110	225	0	184	?	?	?
7	32	1	2	125	254	0	155	?	?	?
8	33	1	3	120	298	0	185	?	?	?
9	34	0	2	130	161	0	190	?	?	?