Machine Learning for Health Metricians

Week 3: Basic Algorithms


In [ ]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')

Any Questions?

Week 3

  • Just one class: review homework, discuss the key points from the reading, begin exercise of applying basic algorithms with scikits-learn
  • Before this week:
    • Read 2 chapters (DM 4, ISL 3)
    • Complete Predictive Accuracy computational exercise
  • During this week’s classes:
    • Hands-on application to DHS data.
  • Outside of classes:
    • Read 2 chapters of course text (DM 5, ISL 5)
    • Complete “Prediction” computational exercise

Lecture 3 Outline:

  • Homework Solutions
  • Cell phone ownership in DHS
  • Basic algorithms: Naïve Bayes, Linear Models, Decision Trees
  • Exercise 3: Prediction

Any (more) questions?

Anticipated questions: something about difference between statistics and machine learning, since the whole ISL chapter for this week is very familiar material on linear regression, while the DM chapter is a wild ride through a range of methods, from naive bayes to logistic regression to recursive partitioning/decision trees.

Homework Solutions

  • Searching for best $k$ value in $k$-NN

I have a set of solutions that I can circulate, but I think I will see if anyone in class is so happy with their solution that they want to share it. It should be easy to get it in front of everyone with the Sage cloud.

An aside on Open Source

  • What is Open Source Software?
  • What is GitHub?

  • What is Git?

  • Who wants more?

  • Who wants in?

Now back to our regularly scheduled program

Lecture 3 Outline:

  • Homework Solutions
  • Cell phone ownership in DHS
  • Basic algorithms: Naïve Bayes, Linear Models, Decision Trees
  • Exercise 3: Prediction

DHS Data


In [17]:
df = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS.CSV', index_col=0)
df.head()


Out[17]:
sh110g hv247 hv246 hv227 sh118f hv243a hv243c hv243b hv225 hv209 hv208 hv243d hv207 hv206 hv212 hv221 hv210 hv211
0 0 1 0 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0
1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0
2 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1
3 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

In [18]:
cb = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS_codebook.CSV', index_col=0)
cb


Out[18]:
full name
hv206 has electricity
hv207 has radio
hv208 has television
hv209 has refrigerator
hv210 has bicycle
hv211 has motorcycle/scooter
hv212 has car/truck
hv221 has telephone (landline)
hv225 share toilet with other households
hv227 has mosquito bed net for sleeping
hv243a has mobile telephone
hv243b has watch
hv243c has animal-drawn cart
hv243d has boat with a motor
hv246 owns livestock, herds, or farm animals
hv247 has bank account
sh110g computer
sh118f boat without motor

Lecture 3 Outline:

  • Homework Solutions
  • Cell phone ownership in DHS
  • Basic algorithms: Naïve Bayes, Linear Models, Decision Trees
  • Exercise 3: Prediction

Basic Algorithms

Naïve Bayes


In [ ]:
import sklearn.naive_bayes
clf = sklearn.naive_bayes.BernoulliNB()

Linear Models


In [ ]:
import sklearn.linear_model
clf = sklearn.linear_model.LinearRegression()

Decision Trees


In [ ]:
import sklearn.tree
clf = sklearn.tree.DecisionTreeClassifier()

Thought Experiment

  • Which do you expect to work best for predicting cell phone ownership?

Now see if you agree with your neighbors, and try to convince each other, if not.

Lecture 3 Outline:

  • Homework Solutions
  • Cell phone ownership in DHS
  • Basic algorithms: Naïve Bayes, Linear Models, Decision Trees
  • Exercise 3: Prediction

Homework:

  • Complete the many ways of predicting cell phone ownership exercise
  • Read

In [16]:
import ipynb_style
reload(ipynb_style)
ipynb_style.presentation()


Out[16]: