Anticipated questions: based on last class: what is machine learning, how does it differ from statistics; based on exercise/homework: what is a decision tree, how do I __ in python; based on reading: what is a feature, what out-of-sample pv, what is this obsession with rule-based learning
The length-two decision list may be too complex to go over in class, but may not be. It was too hard for the first homework when this was a one credit class, but now that it is 3 credits and the elevator pitch is not assigned in the first week perhaps it works.
I left a lot of the details unspecified here, and that could be annoying for students who like math problems to have right answers. There is a point, however... in methods research there is usually not a unique right answer. Figuring out the question is a big part of the work.
In [40]:
import pandas as pd
df = pd.read_csv('college.csv', index_col=0)
df.head()
Out[40]:
The thing you are learning. DM categorizes machine learning tasks into four categories: classification, association, clustering, and "numeric prediction" aka regression.
There is one that I think is missing, which has been of particular interest to my health systems buddies recently: "Risk Prediction". This could also be called "probabilistic prediction", and it is a mash-up of numeric prediction and classification. For example, we want to know if a patient will survive at least 30 days after having a heart attack. A yes-/no-classification answer is not appropriate. I would like to know the probability that the answer is yes. It is particularly challenging to measure predictive quality for this concept.
What we are learning from. This is also called an "instance", or a feature vector. It is a row in the input dataset.
Also called a "feature", this is a column in the input dataset.
In [41]:
import IPython.display
In [42]:
IPython.display.YouTubeVideo("4TBcQ8h_kXU")
Out[42]:
Also know as "knowledge representation", this is how the results of processing the data will be represented.
The authors of Data Mining show their CS/AI background by their focus on rule-based knowledge representation and especially on this rules-with-exceptions section, since these are almost never learned empirical from data (in my experience). This comes from the pre-statistical learning days of AI, when people thought AI would emerge from putting together enough true facts and the appropriate logic-based inference rules.
Despite to luke-warm attitude of the authors of DM, this is one of my favoriate approaches, and we will explore it with the $k$-nearest neighbor algorithm today.
Unsupervised learning is often much more of a challenge that supervised learning, and cluster are often what comes out. The question is "are they any good"?
In [43]:
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_9.png', embed=True)
Out[43]:
In [44]:
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_15.png', embed=True)
Out[44]:
In [45]:
IPython.display.Image(filename='/home/j/Project/Machine-Learning/ML4HM/ISL_Fig_2_17.png', embed=True)
Out[45]:
In [46]:
import ipynb_style
reload(ipynb_style)
ipynb_style.presentation()
Out[46]: