This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2015. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/).

An Introduction to scikit-learn: Machine Learning in Python

Goals of this Tutorial

  • Introduce the basics of Machine Learning, and some skills useful in practice.
  • Introduce the syntax of scikit-learn, so that you can make use of the rich toolset available.

Schedule:

Outline:

9:00 - 9:15 Preliminaries: Setup & introduction

  • Making sure your computer is set-up

9:15 - 10:00 Basic Principles of Machine Learning and the Scikit-learn Interface

  • What is Machine Learning?
  • Machine learning data layout
  • Supervised Learning
    • Classification
    • Regression
    • Measuring performance
  • Unsupervised Learning
    • Clustering
    • Dimensionality Reduction
    • Density Estimation
  • Evaluation of Learning Models
  • Choosing the right algorithm for your dataset

10:00 - 10:45 Supervised learning in-depth

  • Support Vector Machines
  • Decision Trees and Random Forests

10:45 - 11:00: break

11:00 - 11:45 Unsupervised learning in-depth

  • Dimensionality Reduction: Principal Component Analysis
  • Clustering: K Means
  • Density Estimation: Gaussian Mixture Models
  • Application: image color compression

11:45 - 12:20 Validation and Model Selection

  • Overfitting, Underfitting, bias, and variance
  • Improving your fit: validation curves and learning curves
  • Application: facial recognition

Preliminaries

This tutorial requires the following packages:

The easiest way to get these is to use the conda environment manager. I suggest downloading and installing miniconda.

The following command will install all required packages:

$ conda install numpy scipy matplotlib scikit-learn ipython-notebook

Alternatively, you can download and install the (very large) Anaconda software distribution, found at https://store.continuum.io/.

Checking your installation

You can run the following code to check the versions of the packages on your system:

(in IPython notebook, press shift and return together to execute the contents of a cell)


In [1]:
from __future__ import print_function

import IPython
print('IPython:', IPython.__version__)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import matplotlib
print('matplotlib:', matplotlib.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

import seaborn
print('seaborn', seaborn.__version__)


IPython: 2.4.1
numpy: 1.9.2
scipy: 0.15.1
matplotlib: 1.4.3
scikit-learn: 0.15.2
seaborn 0.5.1

Useful Resources