Visualizer for all SciPy distributions

The awesome scipy.stats subpackage holds a lot of continuous and discrete distributions that you might never have heard of. To quickly familiarize oneself with an unknown distribution plotting and experiencing the distribution helps a lot. This visualiser tries to make this as easy and comfortable as possible. This tool is based on Bokeh and ipywidgets.

TL;DR: Just run all cells and be stunned!


In [1]:
import warnings
from collections import OrderedDict, defaultdict
from enum import Enum

import numpy as np
from scipy import stats

from bokeh.io import output_notebook, show, push_notebook
from bokeh.plotting import figure
from bokeh.io import show

from ipywidgets import widgets, interact, interactive
from IPython.display import display 

warnings.simplefilter('ignore', DeprecationWarning)
output_notebook()


Loading BokehJS ...

We start with a little introspection to get lists of all continuous and discrete distributions in scipy.stats. In order to do so, we use an Enum to define the two types of distribution that exists in the world of mathematics, i.e. continuous and discrete distributions. Based on a given type we determine all classes that have as base class either stats.rv_contiuous or stats.rv_discrete and create an ordered dictionary with the distribution's name as key and the object of the distribution as value.


In [2]:
class DistType(Enum):
    continuous = 0
    discrete = 1
    
dist_types = OrderedDict([('continuous', DistType.continuous),
                          ('discrete', DistType.discrete)])

def get_dict_of_dists(dist_type):
    if dist_type is DistType.continuous:
        baseclass = stats.rv_continuous
    else:
        baseclass = stats.rv_discrete
    dists = [getattr(stats, d) for d in sorted(dir(stats)) if isinstance(getattr(stats,d), baseclass)]
    return OrderedDict([(dist.name, dist) for dist in dists])

dist_continuous = get_dict_of_dists(DistType.continuous)
dist_discrete = get_dict_of_dists(DistType.discrete)
print('number of continuous distributions:', len(dist_continuous))
print('number of discrete distributions:  ', len(dist_discrete))


number of continuous distributions: 89
number of discrete distributions:   13

Since a lot of distributions need additional shape parameters we use a nested defaultdict to define shape parameters as we go. For an undefined distribution DEFAULT_SHAPES will return 1.0 for all shape parameters.

The DEFAULT_SHAPES dictionary is not exhaustive, meaning that a lot of sane parameters still need to be configured. We access those default parameters with the help of the function default_shape_params.


In [3]:
def make_default_shape_dict():
    shape_param = defaultdict(lambda: 1.0)
    return defaultdict(lambda: shape_param)

def default_shape_params(dist):
    return OrderedDict([(p, DEFAULT_SHAPES[dist.name][p]) for p in shape_params(dist)])

DEFAULT_SHAPES = make_default_shape_dict()
DEFAULT_SHAPES['alpha'] = {'a': 1.3}
DEFAULT_SHAPES['beta'] = {'a': 1.5, 'b': 2.}

# discrete
DEFAULT_SHAPES['bernoulli'] = {'p': 0.7}
DEFAULT_SHAPES['binom'] = {'n': 10, 'p': 0.7}
DEFAULT_SHAPES['logser'] = {'p': 0.3}
DEFAULT_SHAPES['zipf'] = {'a': 2}
DEFAULT_SHAPES['randint'] = {'low': 0, 'high': 10}
DEFAULT_SHAPES['nbinom'] = {'n': 10, 'p': 0.6}
DEFAULT_SHAPES['hypergeom'] = {'n': 3, 'M': 10, 'N': 7}
DEFAULT_SHAPES['geom'] = {'p': 0.6}

Every project needs some purely auxiliary functions that help to keep the real program logic shorter and much more comprehensible. We define them in advance and all of them should be pretty much self-explanatory. Eventually, we have functions to:

  • flatten a list of list,
  • calculate the support of a distribution,
  • create patches, i.e. bars, from (x, y) data points,
  • determine the shape parameters of a distribution,
  • check if a distribution has shape parameters,
  • determine the options for a distribution selector widget,
  • determine the options for a function selector widget.

In [4]:
def flatten(lst):
    return [item for sublist in lst for item in sublist]

def support(dist, *shapeargs):
    # due to bug in scipy.levy_stable no keyword args for interval
    return dist.interval(1.0, *shapeargs)

def make_patches(x, y, width=0.8):
    m = width/2
    x = [[p-m, p-m, p+m, p+m] for p in x]
    y = [[0, p, p, 0] for p in y]
    return x, y

def shape_params(dist):
    if dist.shapes is not None:
        return dist.shapes.split(', ')
    
def has_shape_params(dist):
    return dist.numargs != 0

def dist_options(dist_type):
    if dist_type is DistType.continuous:
        return dist_continuous
    else:
        return dist_discrete

def func_options(dist_type):
    if dist_type is DistType.continuous:
        return ['pdf', 'cdf', 'ppf']
    else:
        return ['pmf', 'cdf']

The whole tool is basically about evaluating different functions, e.g. pdf, cdf, etc., of a distribution. So what we need to do is:

  1. determining the support of the function
  2. check if the distribution is continuous or discrete
  3. define a set of suitable x-values
  4. evaluate the given function on that set of x and return x and y

In [5]:
def get_dist_func_xy(dist, func, *shapeargs, **params):
    if func == 'ppf':
        interval = [0., 1.]
    else:
        interval = list(support(dist, *shapeargs))
    if dist in dist_continuous.values():
        for i, x in enumerate(interval):
            if np.isinf(x):
                interval[i] = np.sign(x)*100
            interval[i] += (-1)**i*1e-3
        l, r = interval
        x = np.linspace(l, r, 100*(r-l))
    elif dist in dist_discrete.values():
        for i, x in enumerate(interval):
            if np.isinf(x):
                interval[i] = np.sign(x)*20
        l, r = interval        
        x = np.arange(l+1, r+1)
    else:
        raise RuntimeError("Unknown distribution: {}".format(dist.name))
    y = getattr(dist, func)(x, *shapeargs, **params)
    return x, y

Here comes the heavy lifting. Later on, we will define selector widgets for the

  • type of distribution,
  • distribution itself,
  • function of the distribution,
  • parameters of the distribution (loc and scale for continuous distributions),
  • shape parameters of the distribution if they exist,

and therefore we need functions that steer the behaviour of the whole tool when one of the widgets changes its value. Because of that all functions start with the prefix update_ and basically wire all widgets together. For instance if currently the normal distribution is selected and you choose the distribution type discrete we need to also set the distribution selector to a discrete distribution which also triggers the function selector in order to choose a suitable function like pmf for a discrete distribution.


In [6]:
def update_type_sel():
    dist_sel.options = dist_options(type_sel.value)
    
def update_dist_sel():
    func_sel.options = func_options(type_sel.value)
    if has_shape_params(dist_sel.value):
        shapes = default_shape_params(dist_sel.value)
        text_inputs = [widgets.BoundedFloatText(value=v, description='{}:'.format(k)) for k, v in shapes.items()]
        [w.on_trait_change(update_dist_params, name='value') for w in text_inputs]
        shape_param_container.children = text_inputs
    else:
        shape_param_container.children = []
    if type_sel.value is DistType.continuous:
        param_container.children = [loc_slider, scale_slider]
    else:
        param_container.children = []
    update_dist_params()
    
def refresh_continuous(fig, data, *shapeargs):
    params = dict(loc=loc_slider.value, scale=scale_slider.value)
    data['x'], data['y'] = get_dist_func_xy(dist_sel.value, func_sel.value, *shapeargs, **params)
    fig.y_range.start = max(np.max(data['y']) - 5, 1.1*np.min(data['y']))
    fig.y_range.end = min(np.min(data['y']) + 5, 1.1*np.max(data['y']))
    offset, lim = 1e-1, 5
    fig.x_range.start = max(-lim, np.min(data['x']) - offset)
    fig.x_range.end = min(lim, np.max(data['x']) + offset)
    
def refresh_discrete(fig, data, *shapeargs):
    x, y = get_dist_func_xy(dist_sel.value, func_sel.value, *shapeargs)
    data['xs'], data['ys'] = make_patches(x, y)
    fig.y_range.start, fig.y_range.end = 0., 1.1*np.max(y)
    fig.x_range.start = max(-10, np.min(x) - 1)
    fig.x_range.end = min(10, np.max(x) + 1)
    
def update_dist_params():
    shapeargs = [c.value for c in shape_param_container.children]
    l_data['x'], l_data['y'] = [], []
    p_data['xs'], p_data['ys'] = [], []
    try:
        if type_sel.value is DistType.continuous:
            refresh_continuous(fig, l_data, *shapeargs)
        else:
            refresh_discrete(fig, p_data, *shapeargs)
    except Exception as e:
        error_text.value = "Invalid parameters! Choose again.<br>ERROR: {}".format(e)
        error_text.visible = True
    else:
        error_text.visible = False
    push_notebook()

To render the function values of the distribution we will use Bokeh which is a lot more appropriate for interactive visualisation than matplotlib and looks much nicer by default. Bokeh itself comes with a lot of widgets and you can do wonderful things like complete reporting web interfaces with them but at this point my design decision was to stick with Jupyter. Jupyter itself has ipywidgets and currently the advantage is that the Python community is just crazy about Jupyter meaning that there are also a lot of cool services like binder, tmpnb and many more. Since Bokeh widgets need a Bokeh server to be functional it is much easier right now to find a free service like binder that operates your notebook.

That being said, here we go. We are basically following Bokeh's Working in the Notebook tutorial and start by defining a figure as well as two glyphs, line for plotting continuous distributions and patches in order to plot discrete distributions. Currently, there is a limitation in the Jupyter/Bokeh interaction that allows you to only change the values in a plot of the last figure that was displayed. More important though is the fact that you change the last figure by changing the input sources (data_source) of the glyph's renderer and push them to the notebook via push_notebook(). For that reason we set up only one figure including glyphs for plotting continuous as well as discreet functions and return the figure as well as the data sources of both renderers.


In [7]:
def get_dist_fig_data():
    fig = figure(width=700, height=700, title=None, x_range=(-1, 1), y_range=(0, 1))
    ren_p = fig.patches([[]], [[]], line_width=3, alpha=0.3)
    ren_l = fig.line([], [], line_width=3)
    return fig, ren_l.data_source.data, ren_p.data_source.data

At this point we have everything set up and what's left to do is only to define the actual widgets and use the functionality defined before to wire them up. Currently I am quite dissatisfied with the fact that the update functions work a lot on our globally defined widgets. It would be way more explicit if the update function had parameters for all widgets they are working on. One solution could be to make use of the Functor Pattern with the help of functools.partial but I am not 100% convinced. Another possibility is put everything into one giant class which is also not my cup of tea. Drop me a line in the comments if you have a solution to that.


In [8]:
# Create widgets for selecting the type, distribution and function
type_sel = widgets.Dropdown(options=dist_types, description='type:')
dist_sel = widgets.Dropdown(options=dist_options(type_sel.value), description='dist:')
func_sel = widgets.Dropdown(options=func_options(type_sel.value), description='func:')

# Align the widgets in a horizontal box
dist_container = widgets.HBox()  
dist_container.children = [type_sel, dist_sel, func_sel]

# Wire the widgets to their corresponding update function
type_sel.on_trait_change(update_type_sel, name='value')
dist_sel.on_trait_change(update_dist_sel, name='value')
func_sel.on_trait_change(update_dist_params, name='value')

# Create widgets for parameter selection and boxes to align them
loc_slider = widgets.FloatSlider(value=0., min=-5.0, max=5.0, step=0.1, description='loc:')
scale_slider = widgets.FloatSlider(value=1., min=0.01, max=10.0, step=0.01, description='scale:')
loc_slider.on_trait_change(update_dist_params, name='value')
scale_slider.on_trait_change(update_dist_params, name='value')
param_container = widgets.VBox()
shape_param_container = widgets.HBox()
error_text = widgets.HTML()

# Display the widgets
display(dist_container)
display(param_container)
display(shape_param_container)
display(error_text)

# Generate the Bokeh plot and display
fig, l_data, p_data = get_dist_fig_data()
show(fig)

# Let's select the famous normal distribution for starters
type_sel.value = DistType.continuous
dist_sel.value = dist_continuous['norm']