In this notebook we are going to explain how the Gibbs Sampling Algorithm works through a very easy to understand example. But before we jump to the example, we are going to see some theory.
The Gibbs sampler is a method that helps us to obtain the probability distribution of two or more variables, when the direct calculation is very difficult. In other words, we are trying to calculate the probability of p(A,B), using a sampling method, because direct calculation is very difficult (for example when we have to calculate a very complicated integral).
The Gibbs sampler recreates the probabilty distribution p(A,B) by calculating p(A|B) and p(A|B) (assuming that both are very easy to calculate). It starts obtaining values for A and B, based on p(A|B) and p(B|A) until a sufficient number of values is reached. When all the values are obtained, then one calculate p(A=a,B=b) by counting the number of sample with A=a and B=b divided by the number of total samples.
The algorithm is the following:
Since the Gibbs sampler is a Markov Chain Monte Carlo technique, the samples obtained are not independent from the previuos sample obtained. In order to have a more accurate list of samples, a technique called thinning is used. This technique consists on taking just taking the samples every Nth cycle, for instance taking the samples every 100 cycles.
Another technique used is ignoring the first N sample, this is called the burn-in period.
To explain this algorithm we are going to use a cards example.
Let's say we have a deck of cards but some cards are missing. The total missing cards are:
3 faces in hearts
6 numbers in clubs
In total, 13 cards missing
To validate that our Gibbs sampler is working correctly, we are going to present the probability table
p(a=red) = 22/37 p(a=black) = 15/37
p(b=face) = 6/37 p(b=number) = 31/37
p(a=red, b=face) = 5/37 p(a=black, b=face) = 1/37 p(a=red, b=number) = 17/37 p(a=black, b=number) = 14/37
p(a=red | b=face) = 5/6 p(a=black | b=face) 1/6
p(a=red | b=number) = 17/31 p(a=black | b=number) = 14/31
p(b=face | a=red) = 5/22 p(b=number | a=red) = 17/22
p(b=face | a=black) = 1/15 p(b=number | a=black) = 14/15
The idea is then to calculate the values of p(A,B) just by using p(A|B) and p(B|A)
In [1]:
import random
__author__ = 'fpena'
colors = ['red', 'black']
symbols = ['faces', 'numbers']
probability_table = [
# Probability that color is red given that symbols is face (position 0) or
# symbol is number (position 1)
[5./6, 17./31],
# Probability that symbol is face given that color is red (position 0) or
# color is black (position 1)
[5./22, 1./15]
]
class GibbsSampler:
def __init__(self):
self.sample_list = []
def sample(self):
# Set the initial values for the variables
values = [
random.randint(0, 1), # Color value
random.randint(0, 1) # Symbol value
]
for i in xrange(3700000):
# Select which variable to take (colors or card types)
var_type = random.randint(0, 1)
if var_type == 0:
still_variable = 1
else:
still_variable = 0
new_random_value = random.random()
if new_random_value > probability_table[var_type][values[still_variable]]:
values[var_type] = 1
else:
values[var_type] = 0
if i % 100 == 0:
self.store_variables(values[0], values[1])
# self.store_variables(values[0], values[1])
print('Length', len(self.sample_list))
def store_variables(self, color, symbol):
self.sample_list.append((colors[color], symbols[symbol]))
# print_variables(color, symbol)
def count_results(self):
pairs = [
('red', 'faces'),
('red', 'numbers'),
('black', 'faces'),
('black', 'numbers')
]
for pair in pairs:
pair_count = 0
for sample in self.sample_list:
if sample == pair:
pair_count += 1
print(pair, pair_count)
def print_variables(color, symbol):
print(colors[color], symbols[symbol])
sampler = GibbsSampler()
sampler.sample()
sampler.count_results()
As you can see in the resutls above, the probability values obtained are very closed to the original ones (just divide the values between 1000 to obtain the expected number of cards).