Machine Learning for Health Metricians

Week 2: Input, Output, and Accuracy

Any Questions?

No Class Next Monday

Week 2

  • Class 1: review homework, discuss the key points from the reading, begin to replicate two cool figures from ISL.
  • Class 2: demonstrate GPR and $k$-NN with scikits-learn
  • Before this week: read 2+ chapters (DM 2-3, ISL 2.2)
  • During this week’s classes:
    • Talk in the language of machine learning
    • Run GPR, $k$-NN algorithm
    • Calculate training, test error
  • Outside of classes:
    • Read chapters of course texts (DM 4, ISL 3)
    • Work through “Predictive Accuracy” computational exercise

Lecture 2a Outline:

  • Homework Solutions
  • Input
  • Output
  • Accuracy
  • Exercise 2: In-sample and Out-of-sample Predictive Validitiy

Any (more) questions?

Anticipated questions: based on last class: what is machine learning, how does it differ from statistics; based on exercise/homework: what is a decision tree, how do I __ in python; based on reading: what is a feature, what out-of-sample pv, what is this obsession with rule-based learning

Homework Solutions

  • Searching for best decision
  • Length-two decision list

The length-two decision list may be too complex to go over in class, but may not be. It was too hard for the first homework when this was a one credit class, but now that it is 3 credits and the elevator pitch is not assigned in the first week perhaps it works.

I left a lot of the details unspecified here, and that could be annoying for students who like math problems to have right answers. There is a point, however... in methods research there is usually not a unique right answer. Figuring out the question is a big part of the work.

Input


In [40]:
import pandas as pd
df = pd.read_csv('college.csv', index_col=0)
df.head()


Out[40]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15

What is a concept?

The thing you are learning. DM categorizes machine learning tasks into four categories: classification, association, clustering, and "numeric prediction" aka regression.

There is one that I think is missing, which has been of particular interest to my health systems buddies recently: "Risk Prediction". This could also be called "probabilistic prediction", and it is a mash-up of numeric prediction and classification. For example, we want to know if a patient will survive at least 30 days after having a heart attack. A yes-/no-classification answer is not appropriate. I would like to know the probability that the answer is yes. It is particularly challenging to measure predictive quality for this concept.

What is an example?

What we are learning from. This is also called an "instance", or a feature vector. It is a row in the input dataset.

What is an attribute?

Also called a "feature", this is a column in the input dataset.

Last word on input

  • "Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process."

In [41]:
import IPython.display

Video Interlude


In [42]:
IPython.display.YouTubeVideo("4TBcQ8h_kXU")


Out[42]:

Output

Also know as "knowledge representation", this is how the results of processing the data will be represented.

Tables

Linear Models

  • $\mathrm{PRP} = 37.06 + 2.47\cdot\mathrm{CACH}$
  • $2.0 - 0.5 \times \text{PETAL-LENGTH} - \text{PETAL-WIDTH} = 0$

Trees

Classification Rules

if a then x
if c and d then y
else z

Association Rules

if windy = false and play = no then
  outlk = sunny and humidity = high

Rules with Exceptions

The authors of Data Mining show their CS/AI background by their focus on rule-based knowledge representation and especially on this rules-with-exceptions section, since these are almost never learned empirical from data (in my experience). This comes from the pre-statistical learning days of AI, when people thought AI would emerge from putting together enough true facts and the appropriate logic-based inference rules.

Instance-based Representations

Despite to luke-warm attitude of the authors of DM, this is one of my favoriate approaches, and we will explore it with the $k$-nearest neighbor algorithm today.

Clusters

Unsupervised learning is often much more of a challenge that supervised learning, and cluster are often what comes out. The question is "are they any good"?

Accuracy


In [43]:
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_9.png', embed=True)


Out[43]:

In [44]:
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_15.png', embed=True)


Out[44]:

In [45]:
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_17.png', embed=True)


Out[45]:

Exercise 2

Exercise 2: In-sample and out-of-sample predictive validity, GPR, and $k$-NN

Homework:

  • Complete the replication of ISL Figure 2.17
  • Read

In [46]:
import ipynb_style
reload(ipynb_style)
ipynb_style.presentation()


Out[46]: