Probability is amazing - Machine learning

Learning from data

Given a dataset and a model we have these these interesting probabilities

  • The likelihood $P(Data|Model)$
  • Model fitting: $P(Model|Data)$
  • Predictive distribution: $P(New data| Model, Data)$

Together these retells the machine learning story in a probabilistic words

Simplest learning algorithm never told (the counting)!

Suppose we have a dataset of coin flips $D$, and we want to learn $P(C=H)$, here is the simple algorithm

  • Count the heads
  • Divide by the total trials

In [1]:
import scipy.stats as st

def generateData(true_p, dataset_size):
    return ['H' if i == 1 else 'T' for i in st.bernoulli.rvs(0.3 ,size=dataset_size)]

def estimate_p(data):
    return sum([1.0 if observation == 'H' else 0.0 for observation in data]) / len(data)

In [3]:
#simulate data
true_p = 0.3
dataset_size = 20000

data = generateData(true_p, dataset_size)
p_hat = estimate_p(data)

#print (data)
print ("true p:", true_p)
print ("learned p:", p_hat)


true p: 0.3
learned p: 0.30155

Exercise:

Plot accuracy vs. dataset size


In [39]:
#Your code here !

The counting algorithm origin

First we could model coins by bernoulli random variable $c$ ~ $Ber(\theta)$

now the liklyhood of the data is $$likelihood = P(Data|\theta) = \prod{\theta^x * (1 - \theta)^{1 - x}}$$

And we want $\theta$ that makes the likelihood maximum $$\theta_{mle} = argmax_{\theta}\prod{\theta^x * (1 - \theta)^{1 - x}}$$

This $\theta$ is called MLE or the maximum likelihood estimator, if you do the math (derive likelihood and find the zeros) then you will find that $$\theta_{mle} = \frac{\sum(C=H)}{N}$$

"All models are wrong, but some models are useful. — George Box""

For the next iterations, the baise may change, the coin may got curved, why we can relay on mle estimator ? The answer is subtle but very fundamental to machine learning, it is a theorem called No Free lunch theorem. It states that we could not learn without making assumption.

Exercise:

What is the assumption we made to make it happen ?

How does machine learning scale ? (Independencies)


In [8]:
import itertools
import scipy.stats as st 
import numpy as np
import functools

def generateData(true_p, dataset_size):
    data= np.array([ ['H' if j == 1 else 'T' for j in st.bernoulli.rvs(true_p[i] ,size=dataset_size)] for i in range(0, len(true_p))])
    return data.T.tolist()
    

def estimate_p(data):
    d = len(data[0])
    if d > 23:
        raise 'too many dims'
    toCross = [['H', 'T'] for i in range(0, d)]
    omega = list(itertools.product(*toCross))
    combos = {tuple(x): 0 for x in omega}
    
    for i in data:
        combos[tuple(i)] += 1
    n = len(data)
    p = {k: float(v)/n for (k,v) in combos.items()}
    return p

In [13]:
d = 12
pp = 0.25
data = generateData([pp for i in range(0, d)], 10000)

In [14]:
%%timeit -n 1 -r 1
p = estimate_p(data)

print ("true p: " , pp**d)
print (p[tuple(['H' for i in range(0, d)])])


true p:  5.960464477539063e-08
0.0
1 loops, best of 1: 7.82 ms per loop

In [15]:
def estimate_p_ind(data):
    d = len(data[0])
    omega = [(i, 'H') for i in range(0, d)] + [(i, 'T') for i in range(0, d)]
    combos = {x: 0 for x in omega}
    
    for i in data:
        for ix, j in enumerate(i):
            combos[(ix, j)] += 1
    n = len(data)
    p = {k: float(v)/n for (k,v) in combos.items()}
    return p

In [16]:
%%timeit -n 1 -r 1
p_ind = estimate_p_ind(data)
#p('H', 'H', 'H', 'H') = prod P('H')
t = [p_ind[(i, 'H')] for i in range(0, d)]
p_Hs = functools.reduce(lambda x, y: x* y, t, 1)

print ("true p: " , pp**d)
print(p_Hs)


true p:  5.960464477539063e-08
5.9374260560847265e-08
1 loops, best of 1: 30.4 ms per loop

Probability is a database

The story so far

  • Probability is a database
  • Model is schema allow us to learn
  • We observe and insert (fit) a training set on bulk using optimization
  • we predict using inference