Making Animations of UMAP Hyper-parameters

Sometimes one of the best ways to see the effects of hyperparameters is simply to visualise what happens as they change. We can do that in practice with UMAP by simply creating an animation that transitions between embeddings generated with variations of hyperparameters. To do this we'll make use of matplotlib and its animation capabilities. Jake Vanderplas has a great tutorial if you want to know more about creating animations with matplotlib.

Note: This is a self contained example of how to use UMAP and the impact of individual hyper-parameters. To make sure everything works correctly please use conda. For install and usage details see here

To create animations we need ffmpeg. It can be installed with conda.

If you already have ffmpeg installed on your machine and you know what you are doing you do not need conda. It is only used to install ffmpeg.

=> Remove the next two cells if you are not using conda.


In [1]:
!conda --version


conda 4.7.12

In [2]:
!conda install -c conda-forge ffmpeg -y


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Progams\Miniconda\envs\tf

  added / updated specs:
    - ffmpeg


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.9.11  |       hecc5488_0         181 KB  conda-forge
    certifi-2019.9.11          |           py37_0         155 KB
    ffmpeg-4.2                 |       h6538335_0        23.4 MB  conda-forge
    openssl-1.1.1c             |       hfa6e2cd_0         4.7 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        28.5 MB

The following NEW packages will be INSTALLED:

  ffmpeg             conda-forge/win-64::ffmpeg-4.2-h6538335_0

The following packages will be UPDATED:

  ca-certificates    pkgs/main::ca-certificates-2019.5.15-1 --> conda-forge::ca-certificates-2019.9.11-hecc5488_0
  certifi                                  2019.6.16-py37_1 --> 2019.9.11-py37_0

The following packages will be SUPERSEDED by a higher-priority channel:

  openssl              pkgs/main::openssl-1.1.1c-he774522_1 --> conda-forge::openssl-1.1.1c-hfa6e2cd_0



Downloading and Extracting Packages

openssl-1.1.1c       | 4.7 MB    |            |   0% 
openssl-1.1.1c       | 4.7 MB    |            |   0% 
openssl-1.1.1c       | 4.7 MB    | #5         |  15% 
openssl-1.1.1c       | 4.7 MB    | ###        |  30% 
openssl-1.1.1c       | 4.7 MB    | ####5      |  46% 
openssl-1.1.1c       | 4.7 MB    | ######     |  61% 
openssl-1.1.1c       | 4.7 MB    | #######5   |  75% 
openssl-1.1.1c       | 4.7 MB    | #########  |  90% 
openssl-1.1.1c       | 4.7 MB    | ########## | 100% 

ca-certificates-2019 | 181 KB    |            |   0% 
ca-certificates-2019 | 181 KB    | ########## | 100% 

certifi-2019.9.11    | 155 KB    |            |   0% 
certifi-2019.9.11    | 155 KB    | ########## | 100% 

ffmpeg-4.2           | 23.4 MB   |            |   0% 
ffmpeg-4.2           | 23.4 MB   | 2          |   2% 
ffmpeg-4.2           | 23.4 MB   | 5          |   5% 
ffmpeg-4.2           | 23.4 MB   | 8          |   8% 
ffmpeg-4.2           | 23.4 MB   | #1         |  11% 
ffmpeg-4.2           | 23.4 MB   | #4         |  14% 
ffmpeg-4.2           | 23.4 MB   | #7         |  17% 
ffmpeg-4.2           | 23.4 MB   | ##         |  20% 
ffmpeg-4.2           | 23.4 MB   | ##3        |  23% 
ffmpeg-4.2           | 23.4 MB   | ##6        |  26% 
ffmpeg-4.2           | 23.4 MB   | ##9        |  29% 
ffmpeg-4.2           | 23.4 MB   | ###2       |  32% 
ffmpeg-4.2           | 23.4 MB   | ###5       |  35% 
ffmpeg-4.2           | 23.4 MB   | ###8       |  38% 
ffmpeg-4.2           | 23.4 MB   | ####1      |  41% 
ffmpeg-4.2           | 23.4 MB   | ####4      |  44% 
ffmpeg-4.2           | 23.4 MB   | ####7      |  47% 
ffmpeg-4.2           | 23.4 MB   | #####      |  50% 
ffmpeg-4.2           | 23.4 MB   | #####3     |  53% 
ffmpeg-4.2           | 23.4 MB   | #####6     |  56% 
ffmpeg-4.2           | 23.4 MB   | #####9     |  59% 
ffmpeg-4.2           | 23.4 MB   | ######2    |  62% 
ffmpeg-4.2           | 23.4 MB   | ######5    |  65% 
ffmpeg-4.2           | 23.4 MB   | ######8    |  68% 
ffmpeg-4.2           | 23.4 MB   | #######1   |  71% 
ffmpeg-4.2           | 23.4 MB   | #######4   |  74% 
ffmpeg-4.2           | 23.4 MB   | #######7   |  77% 
ffmpeg-4.2           | 23.4 MB   | ########   |  81% 
ffmpeg-4.2           | 23.4 MB   | ########3  |  84% 
ffmpeg-4.2           | 23.4 MB   | ########6  |  87% 
ffmpeg-4.2           | 23.4 MB   | ########9  |  90% 
ffmpeg-4.2           | 23.4 MB   | #########2 |  93% 
ffmpeg-4.2           | 23.4 MB   | #########5 |  96% 
ffmpeg-4.2           | 23.4 MB   | #########8 |  99% 
ffmpeg-4.2           | 23.4 MB   | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

In [3]:
!python --version


Python 3.7.3

To start we'll need some basic libraries. First numpy will be needed for basic array manipulation. Since we will be visualising the results we will need matplotlib and seaborn. Finally we will need umap for doing the dimension reduction itself.


In [4]:
!pip install numpy matplotlib seaborn umap-learn


Requirement already satisfied: numpy in c:\progams\miniconda\envs\tf\lib\site-packages (1.17.1)
Requirement already satisfied: matplotlib in c:\progams\miniconda\envs\tf\lib\site-packages (3.1.1)
Requirement already satisfied: seaborn in c:\progams\miniconda\envs\tf\lib\site-packages (0.9.0)
Requirement already satisfied: umap-learn in c:\progams\miniconda\envs\tf\lib\site-packages (0.3.10)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\progams\miniconda\envs\tf\lib\site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: cycler>=0.10 in c:\progams\miniconda\envs\tf\lib\site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\progams\miniconda\envs\tf\lib\site-packages (from matplotlib) (2.8.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\progams\miniconda\envs\tf\lib\site-packages (from matplotlib) (2.4.2)
Requirement already satisfied: pandas>=0.15.2 in c:\progams\miniconda\envs\tf\lib\site-packages (from seaborn) (0.25.1)
Requirement already satisfied: scipy>=0.14.0 in c:\progams\miniconda\envs\tf\lib\site-packages (from seaborn) (1.3.1)
Requirement already satisfied: scikit-learn>=0.16 in c:\progams\miniconda\envs\tf\lib\site-packages (from umap-learn) (0.21.3)
Requirement already satisfied: numba>=0.37 in c:\progams\miniconda\envs\tf\lib\site-packages (from umap-learn) (0.45.0)
Requirement already satisfied: setuptools in c:\progams\miniconda\envs\tf\lib\site-packages (from kiwisolver>=1.0.1->matplotlib) (41.0.1)
Requirement already satisfied: six in c:\progams\miniconda\envs\tf\lib\site-packages (from cycler>=0.10->matplotlib) (1.12.0)
Requirement already satisfied: pytz>=2017.2 in c:\progams\miniconda\envs\tf\lib\site-packages (from pandas>=0.15.2->seaborn) (2019.2)
Requirement already satisfied: joblib>=0.11 in c:\progams\miniconda\envs\tf\lib\site-packages (from scikit-learn>=0.16->umap-learn) (0.13.2)
Requirement already satisfied: llvmlite>=0.29.0dev0 in c:\progams\miniconda\envs\tf\lib\site-packages (from numba>=0.37->umap-learn) (0.29.0)

To start let's load everything we'll need


In [5]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib import animation
from IPython.display import HTML
import seaborn as sns
import itertools
sns.set(style='white', rc={'figure.figsize':(14, 12), 'animation.html': 'html5'})

In [6]:
# Ignore UserWarnings
import warnings
warnings.simplefilter('ignore', UserWarning)

In [7]:
from sklearn.datasets import load_digits

In [8]:
from umap import UMAP

To try this out we'll needs a reasonably small dataset (so embedding runs don't take too long since we'll be doing a lot of them). For ease of reproducibility for everyone else I'll use the digits dataset from sklearn. If you want to try other datasets just drop them in here -- COIL20 might be interesting, or you might have your own data.


In [9]:
digits = load_digits()
data = digits.data
data


Out[9]:
array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

We need to move the points in between the embeddings given by different parameter values. There are potentially fancy ways to do this (Something using rotation and reflection to get an initial alignment might be interesting), but we'll use straighforward linear interpolation between the two embeddings. To do this we'll need a simple function that can turn out intermediate embeddings for the in-between frames of the animation.


In [10]:
def tween(e1, e2, n_frames=20):
    for i in range(5):
        yield e1
    for i in range(n_frames):
        alpha = i / float(n_frames - 1)
        yield (1 - alpha) * e1 + alpha * e2
    for i in range(5):
        yield(e2)
    return

Now that we can fill in intermediate frame we just need to generate all the embeddings. We'll create a function that can take an argument and set of parameter values and then generate all the embeddings including the in-between frames.


In [11]:
def generate_frame_data(data, arg_name='n_neighbors', arg_list=[]):
    result = []
    es = []
    for arg in arg_list:
        kwargs = {arg_name:arg}
        if len(es) > 0:
            es.append(UMAP(init=es[-1], negative_sample_rate=3, **kwargs).fit_transform(data))
        else:
            es.append(UMAP(negative_sample_rate=3, **kwargs).fit_transform(data))
        
    for e1, e2 in zip(es[:-1], es[1:]):
        result.extend(list(tween(e1, e2)))
        
    return result

Next we just need to create a function to actually generate the animation given a list of embeddings (one for each frame). This is really just a matter of workign through the details of how matplotlib generates animations -- I would refer you again to Jake's tutorial if you are interested in the detailed mechanics of this.


In [12]:
def create_animation(frame_data, arg_name='n_neighbors', arg_list=[]):
    fig, ax = plt.subplots()
    all_data = np.vstack(frame_data)
    frame_bounds = (all_data[:, 0].min() * 1.1, 
                    all_data[:, 0].max() * 1.1,
                    all_data[:, 1].min() * 1.1, 
                    all_data[:, 1].max() * 1.1)
    ax.set_xlim(frame_bounds[0], frame_bounds[1])
    ax.set_ylim(frame_bounds[2], frame_bounds[3])
    points = ax.scatter(frame_data[0][:, 0], frame_data[0][:, 1], 
                        s=5, c=digits.target, cmap='Spectral', animated=True)
    title = ax.set_title('', fontsize=24)
    ax.set_xticks([])
    ax.set_yticks([])

    cbar = fig.colorbar(
        points,
        cax=make_axes_locatable(ax).append_axes("right", size="5%", pad=0.05),
        orientation="vertical",
        values=np.arange(10),
        boundaries=np.arange(11)-0.5,
        ticks=np.arange(10),
        drawedges=True,
    )
    cbar.ax.yaxis.set_ticklabels(np.arange(10), fontsize=18)

    def init():
        points.set_offsets(frame_data[0])
        arg = arg_list[0]
        arg_str = f'{arg:.3f}' if isinstance(arg, float) else f'{arg}'
        title.set_text(f'UMAP with {arg_name}={arg_str}')
        return (points,)

    def animate(i):
        points.set_offsets(frame_data[i])
        if (i + 15) % 30 == 0:
            arg = arg_list[(i + 15) // 30]
            arg_str = f'{arg:.3f}' if isinstance(arg, float) else f'{arg}'
            title.set_text(f'UMAP with {arg_name}={arg_str}')
        return (points,)

    anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(frame_data), interval=20, blit=True)
    plt.close()
    return anim

Finally a little bit of glue to make it all go together.


In [13]:
def animate_param(data, arg_name='n_neighbors', arg_list=[]):
    frame_data = generate_frame_data(data, arg_name, arg_list)
    return create_animation(frame_data, arg_name, arg_list)

Now we can create an animation. It will be embedded as an HTML5 video into this notebook.


In [14]:
animate_param(data, 'n_neighbors', [3, 4, 5, 7, 10, 15, 25, 50, 100, 200])


Out[14]:

In [15]:
animate_param(data, 'min_dist', [0.0, 0.01, 0.1, 0.2, 0.4, 0.6, 0.9])


Out[15]:

In [16]:
animate_param(data, 'local_connectivity', [0.1, 0.2, 0.5, 1, 2, 5, 10])


Out[16]:

In [17]:
animate_param(data, 'set_op_mix_ratio', np.linspace(0.0, 1.0, 10))


Out[17]: