In this notebook we explore conditional probabilities using an example from music. Imagine that you have a collection of songs consisting of two genres: country and jazz. Some songs have lyrics and some have not i.e they are instrumental. It makes sense that the probability of a song being instrumental depends on whether the song is jazz or country. This can be modeled through conditional probabilities.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np
class Random_Variable:
def __init__(self, name, values, probability_distribution):
self.name = name
self.values = values
self.probability_distribution = probability_distribution
if all(type(item) is np.int64 for item in values):
self.type = 'numeric'
self.rv = stats.rv_discrete(name = name, values = (values, probability_distribution))
elif all(type(item) is str for item in values):
self.type = 'symbolic'
self.rv = stats.rv_discrete(name = name, values = (np.arange(len(values)), probability_distribution))
self.symbolic_values = values
else:
self.type = 'undefined'
def sample(self,size):
if (self.type =='numeric'):
return self.rv.rvs(size=size)
elif (self.type == 'symbolic'):
numeric_samples = self.rv.rvs(size=size)
mapped_samples = [self.values[x] for x in numeric_samples]
return mapped_samples
We can simulate the generation process of conditianal probabilities by appropriately sampling from three random variables.
In [9]:
# samples to generate
num_samples = 1000
## Prior probabilities of a song being jazz or country
values = ['country', 'jazz']
probs = [0.7, 0.3]
genre = Random_Variable('genre',values, probs)
# conditional probabilities of a song having lyrics or not given the genre
values = ['no', 'yes']
probs = [0.9, 0.1]
lyrics_if_jazz = Random_Variable('lyrics_if_jazz', values, probs)
values = ['no', 'yes']
probs = [0.2, 0.8]
lyrics_if_country = Random_Variable('lyrics_if_country', values, probs)
# conditional generating proces first sample prior and then based on outcome
# choose which conditional probability distribution to use
random_lyrics_samples = []
for n in range(num_samples):
# the 1 below is to get one sample and the 0 to get the first item of the list of samples
random_genre_sample = genre.sample(1)[0]
# depending on the outcome of the genre sampling sample the appropriate
# conditional probability
if (random_genre_sample == 'jazz'):
random_lyrics_sample = (lyrics_if_jazz.sample(1)[0], 'jazz')
else:
random_lyrics_sample = (lyrics_if_country.sample(1)[0], 'country')
random_lyrics_samples.append(random_lyrics_sample)
# output 1 item per line and output the first 20 samples
for s in random_lyrics_samples[0:20]:
print(s)
Notice that we have generated samples of whether the song has lyrics or not. Above I have also printed the associated genre label. In many probabilistic modeling problems some information is not available to the observer. For example we could be provided only the yes/no outcomes and the genres could be "hidden".
Now let's use these generated samples to estimate probabilities of the model. Basically we pretend that we don't know the parameters and estimate them directly by frequency counting through the samples we generated.
In [10]:
# First only consider jazz samples
jazz_samples = [x for x in random_lyrics_samples if x[1] == 'jazz']
for s in jazz_samples[0:20]:
print(s)
Now that we have selected the samples that are jazz we can simply count the lyrics yes and lyrics no entries and divide them by the total number of jazz samples to get estimates of the conditional probabilities. Think about the relationships: we can use the data to estimate the parameters of a model (learning), we can use the model to generate samples (generation), and we can use the model to calculate probabilities for various events (inference).
In [12]:
est_no_if_jazz = len([x for x in jazz_samples if x[0] == 'no']) / len(jazz_samples)
est_yes_if_jazz = len([x for x in jazz_samples if x[0] == 'yes']) / len(jazz_samples)
print(est_no_if_jazz, est_yes_if_jazz)
We have seen in the slides that the probability of a song being jazz if we know that it is instrumental is 0.66. \begin{equation} P(genre = jazz | hasLyrics = no) = \frac{0.3 * 0.9}{0.3 * 0.9 + 0.7 * 0.2} = 0.66 \end{equation}
This is based on our knowledge of probabilities. If we have some data we can also estimate this probability directly. This is called approximate inference in contrast to the exact inference of $0.66$. When problems become complicated exact inference can become too costly to compute while approximate inference can provide reasonable answers much faster. We will see that later when examining probabilistic graphical models. As you can see in this case both the exact and approximate inference probability estimates are relatively close.
In [14]:
no_samples = [x for x in random_lyrics_samples if x[0] == 'no']
est_jazz_if_no_lyrics = len([x for x in no_samples if x[1] == 'jazz']) / len(no_samples)
print(est_jazz_if_no_lyrics)
In [ ]: