Machine Learning for Health Metricians

Week 2: Input, Output, and Accuracy

Any Questions?

No Class Next Monday

Week 2

Class 1: review homework, discuss the key points from the reading, begin to replicate two cool figures from ISL.
Class 2: demonstrate GPR and $k$-NN with scikits-learn

Before this week: read 2+ chapters (DM 2-3, ISL 2.2)
During this week’s classes:
- Talk in the language of machine learning
- Run GPR, $k$-NN algorithm
- Calculate training, test error
Outside of classes:
- Read chapters of course texts (DM 4, ISL 3)
- Work through “Predictive Accuracy” computational exercise

Lecture 2a Outline:

Homework Solutions
Input
Output
Accuracy
Exercise 2: In-sample and Out-of-sample Predictive Validitiy

Any (more) questions?

Anticipated questions: based on last class: what is machine learning, how does it differ from statistics; based on exercise/homework: what is a decision tree, how do I __ in python; based on reading: what is a feature, what out-of-sample pv, what is this obsession with rule-based learning

Homework Solutions

Searching for best decision
Length-two decision list

The length-two decision list may be too complex to go over in class, but may not be. It was too hard for the first homework when this was a one credit class, but now that it is 3 credits and the elevator pitch is not assigned in the first week perhaps it works.

I left a lot of the details unspecified here, and that could be annoying for students who like math problems to have right answers. There is a point, however... in methods research there is usually not a unique right answer. Figuring out the question is a big part of the work.

Input



In [40]:

    
import pandas as pd
df = pd.read_csv('college.csv', index_col=0)
df.head()









    Out[40]:






  
    
      
      Private
      Apps
      Accept
      Enroll
      Top10perc
      Top25perc
      F.Undergrad
      P.Undergrad
      Outstate
      Room.Board
      Books
      Personal
      PhD
      Terminal
      S.F.Ratio
      perc.alumni
      Expend
      Grad.Rate
    
  
  
    
      Abilene Christian University
       Yes
       1660
       1232
       721
       23
       52
       2885
        537
        7440
       3300
       450
       2200
       70
       78
       18.1
       12
        7041
       60
    
    
      Adelphi University
       Yes
       2186
       1924
       512
       16
       29
       2683
       1227
       12280
       6450
       750
       1500
       29
       30
       12.2
       16
       10527
       56
    
    
      Adrian College
       Yes
       1428
       1097
       336
       22
       50
       1036
         99
       11250
       3750
       400
       1165
       53
       66
       12.9
       30
        8735
       54
    
    
      Agnes Scott College
       Yes
        417
        349
       137
       60
       89
        510
         63
       12960
       5450
       450
        875
       92
       97
        7.7
       37
       19016
       59
    
    
      Alaska Pacific University
       Yes
        193
        146
        55
       16
       44
        249
        869
        7560
       4120
       800
       1500
       76
       72
       11.9
        2
       10922
       15

What is a concept?

The thing you are learning. DM categorizes machine learning tasks into four categories: classification, association, clustering, and "numeric prediction" aka regression.

There is one that I think is missing, which has been of particular interest to my health systems buddies recently: "Risk Prediction". This could also be called "probabilistic prediction", and it is a mash-up of numeric prediction and classification. For example, we want to know if a patient will survive at least 30 days after having a heart attack. A yes-/no-classification answer is not appropriate. I would like to know the probability that the answer is yes. It is particularly challenging to measure predictive quality for this concept.

What is an example?

What we are learning from. This is also called an "instance", or a feature vector. It is a row in the input dataset.

What is an attribute?

Also called a "feature", this is a column in the input dataset.

Last word on input

"Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process."



In [41]:

    
import IPython.display

Video Interlude



In [42]:

    
IPython.display.YouTubeVideo("4TBcQ8h_kXU")









    Out[42]:

Output

Also know as "knowledge representation", this is how the results of processing the data will be represented.

Tables

Linear Models

$\mathrm{PRP} = 37.06 + 2.47\cdot\mathrm{CACH}$
$2.0 - 0.5 \times \text{PETAL-LENGTH} - \text{PETAL-WIDTH} = 0$

Trees

Classification Rules

if a then x
if c and d then y
else z

Association Rules

if windy = false and play = no then
  outlk = sunny and humidity = high

Rules with Exceptions

The authors of Data Mining show their CS/AI background by their focus on rule-based knowledge representation and especially on this rules-with-exceptions section, since these are almost never learned empirical from data (in my experience). This comes from the pre-statistical learning days of AI, when people thought AI would emerge from putting together enough true facts and the appropriate logic-based inference rules.

Instance-based Representations

Despite to luke-warm attitude of the authors of DM, this is one of my favoriate approaches, and we will explore it with the $k$-nearest neighbor algorithm today.

Clusters

Unsupervised learning is often much more of a challenge that supervised learning, and cluster are often what comes out. The question is "are they any good"?

Accuracy



In [43]:

    
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_9.png', embed=True)









    Out[43]:



In [44]:

    
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_15.png', embed=True)









    Out[44]:



In [45]:

    
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_17.png', embed=True)









    Out[45]:

Exercise 2

Exercise 2: In-sample and out-of-sample predictive validity, GPR, and $k$-NN

Homework:

Complete the replication of ISL Figure 2.17
Read



In [46]:

    
import ipynb_style
reload(ipynb_style)
ipynb_style.presentation()









    Out[46]:

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15