An Introduction to scikit-learn: Machine Learning in Python

Goals of this Tutorial

  • Introduce the basics of Machine Learning, and some skills useful in practice.
  • Introduce the syntax of scikit-learn, so that you can make use of the rich toolset available.


Preliminaries: Setup & introduction (15 min)

  • Making sure your computer is set-up

Basic Principles of Machine Learning and the Scikit-learn Interface (45 min)

  • What is Machine Learning?
  • Machine learning data layout
  • Supervised Learning
    • Classification
    • Regression
    • Measuring performance
  • Unsupervised Learning
    • Clustering
    • Dimensionality Reduction
    • Density Estimation
  • Evaluation of Learning Models
  • Choosing the right algorithm for your dataset

Supervised learning in-depth (15 minutes)

  • Decision Trees and Random Forests

Unsupervised learning in-depth (15 minutes)

  • Principal Component Analysis
  • K-means Clustering

Model Validation (15 minutes)

  • Validation and Cross-validation
  • GridSearchCV

Other topics

  • Pipeline


This tutorial requires the following packages:

The easiest way to get these is to use the conda environment manager. I suggest downloading and installing miniconda.

The following command will install all required packages:

$ conda install numpy scipy matplotlib scikit-learn ipython-notebook

Alternatively, you can download and install the (very large) Anaconda software distribution, found at

Checking your installation

You can run the following code to check the versions of the packages on your system:

(in IPython notebook, press shift and return together to execute the contents of a cell)

In [ ]:
from __future__ import print_function

import IPython
print('IPython:', IPython.__version__)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import matplotlib
print('matplotlib:', matplotlib.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

Useful Resources

In [ ]: