This notebook was originally put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. [Peter Prettenhofer](https://github.com/pprett) adapted it for PyCon Ukraine 2014. Source and license info is on [GitHub](https://github.com/pprett/sklearn_pycon2014/).

An Introduction to scikit-learn: Machine Learning in Python

Goals of this Tutorial

  • Introduce the basics of Machine Learning, and some skills useful in practice.
  • Introduce the syntax of scikit-learn, so that you can make use of the rich toolset available.

Schedule:

Outline:

10:00 - 10:30 Preliminaries: Setup & introduction

  • Making sure your computer is set-up
  • What is Machine Learning?
  • Quick review of Numpy and Matplotlib

10:30 - 11:15 Basic Principles of Machine Learning and the Scikit-learn Interface

  • Machine learning data layout
  • Supervised Learning
    • Classification
    • Regression
    • Measuring performance
  • Unsupervised Learning
    • Clustering
    • Dimensionality Reduction
  • Evaluation of models
  • How to choose the right algorithm for your dataset

11:15 - 12:00 Supervised learning in-depth

  • Two important algorithms: Support Vector Machines and Random Forests
  • Application: recognizing handwritten digits

12:00 - 12:30 Unsupervised learning in-depth

  • Two important algorithms: PCA and K Means
  • Application: Eigen-Faces

12:30 - 13:00 Validation and Model Selection

  • Overfitting, Underfitting, bias, and variance
  • Improving your fit: validation curves and learning curves
  • Application: facial recognition

Preliminaries

This tutorial requires the following packages:

The easiest way to get these is to use an all-in-one installer such as Anaconda from Continuum. These are available for multiple architectures.

Anaconda Setup

If you're using Anaconda, simpy type

conda install scikit-learn

Otherwise it's best to install from source (requires a C compiler):

git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
python setup.py install

Scikit-learn requires NumPy and SciPy, and examples require Matplotlib.

Note: some examples used in this tutorial require the scripts in the fig_code directory, which can be found within the notebooks subdirectory of the Github repository at https://github.com/pprett/sklearn_pycon2014/

Alternatives

  • Linux: If you're on Linux, you can use the linux distribution tools (by typing, for example apt-get install numpy or yum install numpy.

  • Mac: If you're on OSX, there are similar tools such as MacPorts or HomeBrew which contain pre-compiled versions of these packages.

  • Windows: Windows can be challenging: the best bet is probably to use a package installer such as Anaconda, above.

Checking your installation

You can run the following code to check the versions of the packages on your system:

(in IPython notebook, press shift and return together to execute the contents of a cell)


In [1]:
import numpy
print 'numpy:', numpy.__version__

import scipy
print 'scipy:', scipy.__version__

import matplotlib
print 'matplotlib:', matplotlib.__version__

import sklearn
print 'scikit-learn:', sklearn.__version__


numpy: 1.8.2
scipy: 0.9.0
matplotlib: 1.3.1
scikit-learn: 0.15.2

Useful Resources


In [ ]: