Iris univariate joint probability distribution

Again, it's the Iris dataset (I promise I will unleash some 'real' datasets at some point). I've done a lot of bivariate cluster plots, so I wanted to put together a 1d probability distribution based upon a custom query.

In this instance, it's purely plotting the joint probability distribution of each of the variables given the class, e.g. P(sepal_length|iris_class), P(petal_length|iris_class) ... and so on.


In [1]:
%matplotlib inline
import pandas as pd
import sys
sys.path.append("../../../bayesianpy")
import bayesianpy
from bayesianpy.network import Builder as builder

import logging
import os
import math
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import seaborn as sns

logger = logging.getLogger()
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

bayesianpy.jni.attach(logger)

db_folder = bayesianpy.utils.get_path_to_parent_dir("")
iris = pd.read_csv(os.path.join(db_folder, "data/iris.csv"), index_col=False)

Create the network, specifying a latent variable.


In [2]:
network = bayesianpy.network.create_network()
cluster = builder.create_cluster_variable(network, 4)
node = builder.create_multivariate_continuous_node(network, iris.drop('iris_class',axis=1).columns.tolist(), "joint")
builder.create_link(network, cluster, node)

class_variable = builder.create_discrete_variable(network, iris, 'iris_class', iris['iris_class'].unique())
builder.create_link(network, cluster, class_variable)

And finally, query the model, specifying each variable in a separate query (otherwise the query will return a covariance matrix)


In [3]:
head_variables = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

with bayesianpy.data.DataSet(iris, db_folder, logger) as dataset:
    model = bayesianpy.model.NetworkModel(network, logger)
    model.train(dataset)

    queries = [bayesianpy.model.QueryConditionalJointProbability(
                   head_variables=[v],
                    tail_variables=['iris_class']) for v in head_variables]

    (engine, _, _) = bayesianpy.model.InferenceEngine(network).create()
    query = bayesianpy.model.SingleQuery(network, engine, logger)
    results = query.query(queries)
    jd = bayesianpy.visual.JointDistribution()
    fig = plt.figure(figsize=(10,10))

    for i, r in enumerate(list(results)):
        ax = fig.add_subplot(2, 2, i+1)
        jd.plot_distribution_with_variance(ax, iris, queries[i].get_head_variables(), r)

    plt.show()


INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Training model...
Training model...
INFO:root:Finished training model
Finished training model
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:12: DeprecationWarning: Call to deprecated class SingleQuery (Use 'Query' instead.).
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:13: DeprecationWarning: Call to deprecated function or method query (Use 'execute' instead).
C:\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
C:\Anaconda3\lib\site-packages\IPython\core\formatters.py:90: DeprecationWarning: DisplayFormatter._ipython_display_formatter_default is deprecated: use @default decorator instead.
  def _ipython_display_formatter_default(self):
C:\Anaconda3\lib\site-packages\IPython\core\formatters.py:667: DeprecationWarning: PlainTextFormatter._singleton_printers_default is deprecated: use @default decorator instead.
  def _singleton_printers_default(self):