Conditional Probabilities

George Tzanetakis, University of Victoria

In this notebook we explore conditional probabilities using an example from music. Imagine that you have a collection of songs consisting of two genres: country and jazz. Some songs have lyrics and some have not i.e they are instrumental. It makes sense that the probability of a song being instrumental depends on whether the song is jazz or country. This can be modeled through conditional probabilities.

Helper random variable class

Define a helper random variable class based on the scipy discrete random variable functionality providing both numeric and symbolic RVs


In [1]:
%matplotlib inline 
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np 

class Random_Variable: 
    
    def __init__(self, name, values, probability_distribution): 
        self.name = name 
        self.values = values 
        self.probability_distribution = probability_distribution 
        if all(type(item) is np.int64 for item in values): 
            self.type = 'numeric'
            self.rv = stats.rv_discrete(name = name, values = (values, probability_distribution))
        elif all(type(item) is str for item in values): 
            self.type = 'symbolic'
            self.rv = stats.rv_discrete(name = name, values = (np.arange(len(values)), probability_distribution))
            self.symbolic_values = values 
        else: 
            self.type = 'undefined'
    
    def sample(self,size): 
        if (self.type =='numeric'): 
            return self.rv.rvs(size=size)
        elif (self.type == 'symbolic'): 
            numeric_samples = self.rv.rvs(size=size)
            mapped_samples = [self.values[x] for x in numeric_samples]
            return mapped_samples

We can simulate the generation process of conditianal probabilities by appropriately sampling from three random variables.


In [9]:
# samples to generate 
num_samples = 1000

## Prior probabilities of a song being jazz or country 
values = ['country', 'jazz']
probs = [0.7, 0.3]
genre = Random_Variable('genre',values, probs)

# conditional probabilities of a song having lyrics or not given the genre 
values = ['no', 'yes']
probs = [0.9, 0.1] 
lyrics_if_jazz = Random_Variable('lyrics_if_jazz', values, probs)

values = ['no', 'yes']
probs = [0.2, 0.8]
lyrics_if_country = Random_Variable('lyrics_if_country', values, probs)

# conditional generating proces first sample prior and then based on outcome 
# choose which conditional probability distribution to use 

random_lyrics_samples = [] 
for n in range(num_samples): 
    # the 1 below is to get one sample and the 0 to get the first item of the list of samples 
    random_genre_sample = genre.sample(1)[0]

    # depending on the outcome of the genre sampling sample the appropriate 
    # conditional probability 
    if (random_genre_sample == 'jazz'): 
        random_lyrics_sample = (lyrics_if_jazz.sample(1)[0], 'jazz')
    else: 
        random_lyrics_sample = (lyrics_if_country.sample(1)[0], 'country')
    random_lyrics_samples.append(random_lyrics_sample)

# output 1 item per line and output the first 20 samples 
for s in random_lyrics_samples[0:20]: 
    print(s)


('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'country')
('no', 'country')
('no', 'jazz')
('no', 'jazz')
('yes', 'country')
('yes', 'country')
('yes', 'country')
('yes', 'country')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('yes', 'country')
('yes', 'jazz')
('yes', 'country')
('yes', 'country')
('yes', 'country')

Notice that we have generated samples of whether the song has lyrics or not. Above I have also printed the associated genre label. In many probabilistic modeling problems some information is not available to the observer. For example we could be provided only the yes/no outcomes and the genres could be "hidden".

Now let's use these generated samples to estimate probabilities of the model. Basically we pretend that we don't know the parameters and estimate them directly by frequency counting through the samples we generated.


In [10]:
# First only consider jazz samples 
jazz_samples = [x for x in random_lyrics_samples if x[1] == 'jazz']
for s in jazz_samples[0:20]: 
    print(s)


('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('yes', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')
('no', 'jazz')

Now that we have selected the samples that are jazz we can simply count the lyrics yes and lyrics no entries and divide them by the total number of jazz samples to get estimates of the conditional probabilities. Think about the relationships: we can use the data to estimate the parameters of a model (learning), we can use the model to generate samples (generation), and we can use the model to calculate probabilities for various events (inference).


In [12]:
est_no_if_jazz = len([x for x in jazz_samples if x[0] == 'no']) / len(jazz_samples)
est_yes_if_jazz = len([x for x in jazz_samples if x[0] == 'yes']) / len(jazz_samples)
print(est_no_if_jazz, est_yes_if_jazz)


0.8867924528301887 0.11320754716981132

We have seen in the slides that the probability of a song being jazz if we know that it is instrumental is 0.66. \begin{equation} P(genre = jazz | hasLyrics = no) = \frac{0.3 * 0.9}{0.3 * 0.9 + 0.7 * 0.2} = 0.66 \end{equation}

This is based on our knowledge of probabilities. If we have some data we can also estimate this probability directly. This is called approximate inference in contrast to the exact inference of $0.66$. When problems become complicated exact inference can become too costly to compute while approximate inference can provide reasonable answers much faster. We will see that later when examining probabilistic graphical models. As you can see in this case both the exact and approximate inference probability estimates are relatively close.


In [14]:
no_samples = [x for x in random_lyrics_samples if x[0] == 'no']
est_jazz_if_no_lyrics = len([x for x in no_samples if x[1] == 'jazz']) / len(no_samples)
print(est_jazz_if_no_lyrics)


0.6828087167070218

In [ ]: