Lecture 8.1

Machine Learning

We will review some of the basic methods in machine learning.

The text for this is Introduction to Machine Learning by Anreas Muller and Sarah Guido. It focuses on using tools from Python, and not so much on theoretical aspects.

A good source for theoretical development is the book The Elements of Statistical Learning by Hastie, Tibshirani and Friedmna, available online at http://statweb.stanford.edu/~tibs/ElemStatLearn

The software tools come from scikit-learn at http://scikit-learn.org

A video about the tools is available here http://bit.ly/advanced_machine_learning_scikit-learn

Code examples are here: https://github.com/amueller/introduction_to_ml_with_python

What is machine learning

It's about extracting knowledge from data.

Think of presenting a collection of data to a computer that runs some algorithm. The point is to extract useful features from the data (e.g. where are the buildings in this image) or make a decision based on the data (e.g. this email is spam). Machine learning involves an algorithm that learns "on its own" how identify important features or make classification decisions.

Supervised versus unsupervised learning. Discuss.

Questions to ask when attempting to use machine learning:

what questions and I trying to answer? Does the collected data possibly answer that question?
What's the best way to phrase the quesiton as a machine learning problem?
do I have enough data to represent the problem I want to solve?
what features can I extract from the data, and will this aid in answering the questions?
how do I measure success of the algorithm?
how will the machine learning solution interact with the bigger picture of how I make decisions, inform my research, run my business?

Loading software tools

We will need Python, pandas, matplotlib, numpy, scipy, and sklearn

I think all of these are already available in Syzygy, so you just have to import them from within your Jupyter notebook.

Loading the examples from book

You can clone the examples directly into your Jupyter Hub, using a git clone command.

I like doing this directly, as in the %%bash command below. You could also do this in the terminal, if you like. (NOTE: This command takes a while, so be patient.)



In [3]:

    
%%bash
git clone https://github.com/amueller/introduction_to_ml_with_python.git









    



Cloning into 'introduction_to_ml_with_python'...

Now go back to your Jupyter Hub file list, to access the code examples.

Some things to be careful about.

The preamble seems buggy, so you might want to comment out this first line in the examples:

from preamble import *

They forgot the display command, so you need to enter this code in the example note books.

from IPython.display import display



In [7]:

    
from scipy.misc import imread









    



---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-f9d3d927b58f> in <module>()
----> 1 from scipy.misc import imread

ImportError: cannot import name 'imread'

Supervised versus unsupervised.

Classification, Regression.

Regression for classification.

Types of linear regression.

Start with data $$x_1, x_2, x_3, \ldots x_n$$ use tthis to predict some $y$, linearly:

$$y = a_1 x_1 + a_2 x_2 + \cdots + a_n x_n$$

Find values for parameters $a_1 ... a_n$ to get the best fit.

$$\mathbf{y} = X\mathbf{a}$$

where $\mathbf{y}$ is a column vector, and $X$ is a matrix.

Minimize $$|| y - Xa ||_2$$ where we minimize over all choices of vector $a$.

Linear regression.

Least squares.

Ridge regression

Minimize $$|| y - Xa ||_2 + \alpha || a ||_2$$

Tychonov regularization, L2 regularization

Lasso regression $$|| y - Xa ||_2 + \alpha || a ||_1$$

L1 regularization



In [ ]: