DM wurwur debugging



In [1]:

    
% load_ext autoreload
% autoreload 2

import numpy as np
from random import *
from dmww_classes import *
from sampling_helper import *


seed(1) # for debugging

Model with toy corpus

Each wurwur (words world model) instance depends on a world, a corpus, and a lexicon - learned by the model - that connects the two.



In [2]:

    
w = World(n_words=8,n_objs=8)
w.show()
            
c = Corpus(world=w, n_sents=40, n_per_sent=2)
#c.show()









    



n_objs = 8
n_words = 8

Check the lexicon counts via co-occurrence:



In [4]:

    
l.learn_lex(c)
l.plot_lex(w)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-4c4531bc17b3> in <module>()
----> 1 l.learn_lex(c)
      2 l.plot_lex(w)

NameError: name 'l' is not defined

Now try this with the Gibbs sampler (the actual model):



In [5]:

    
p = Params(n_samps=20,
           alpha_r=.1,
           alpha_nr=10,
           empty_intent=.0001,
           n_hypermoves=5)

l = GibbsLexicon(c,p,verbose=0,hyper_inf=True)
l.learn_lex(c, p)
l.ref
l.plot_scores()
l.plot_lex(w)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-2e0126c1cb13> in <module>()
      5            n_hypermoves=5)
      6 
----> 7 l = GibbsLexicon(c,p,verbose=0,hyper_inf=True)
      8 l.learn_lex(c, p)
      9 l.ref

NameError: name 'GibbsLexicon' is not defined

Model with mini "real" corpus



In [20]:

    
w = World(corpus = 'corpora/corpus_toy.csv')
w.show()

c = Corpus(world=w, corpus = 'corpora/corpus_toy.csv')
c.show()









    



n_objs = 6
n_words = 14
o: [3 5 0 2 4] w: [ 3  4  2 11  5 10  7]
o: [3 5 0 2 4] w: [13  0  4  1  9  8]
o: [3 5 0 2 1 4] w: [ 3  2 12  6]



In [21]:

    
l = CoocLexicon(w)
l.learn_lex(c)
l.show()









    



[[ 1.  1.  2.  2.  2.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  0.  1.  1.  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.]
 [ 1.  1.  2.  2.  2.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  2.  2.  2.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  2.  2.  2.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  2.  2.  2.  1.  1.  1.  1.  1.  1.  1.  1.  1.]]



In [22]:

    
l = GibbsLexicon(c, p,
                 verbose=0,
                 hyper_inf=True)

l.learn_lex(c, p)
l.plot_lex(w, certainwords = 0)
l.params.show()









    




...................
 *** average sample time: 0.027 sec
intent_hp_b: 1.0000
intent_hp_a: 1.0000
empty_intent: 0.0001
alpha_nr_hp: 2.0000
alpha_r_hp: 1.0000
n_samps: 20.0000
alpha_nr: 10.0000
alpha_r: 0.1000
n_hypermoves: 5.0000

Model with full-scale real corpus



In [9]:

    
corpusfile = 'corpora/corpus.csv'
w = World(corpus=corpusfile)
w.show()

c = Corpus(world=w, corpus=corpusfile)









    



n_objs = 23
n_words = 420



In [10]:

    
p = Params(n_samps=100,
           alpha_r=.1,
           alpha_nr=10,
           empty_intent=.0001,
           n_hypermoves=5)

l = GibbsLexicon(c, p,
                 verbose=0,
                 hyper_inf=True)

l.learn_lex(c,p)









    




...............................................................................

...................
 *** average sample time: 1.667 sec



In [15]:

    
l.plot_lex(w)



In [11]:

    
for o in range(w.n_objs):
    wd = where(l.ref[o,:] == max(l.ref[o,:]))
    print "o: %s, w: %s" % (w.objs_dict[o][0], w.words_dict[wd[0][0]][0])









    



o: rattle, w: yeah
o: mirror, w: david
o: pig, w: pig
o: duck, w: bird
o: ring, w: ring
o: girl, w: and
o: lamb, w: lamb
o: book, w: book
o: hat, w: hat
o: bunny, w: bunnyrabbit
o: sheep, w: sheep
o: eyes, w: look
o: woman, w: mommy
o: kitty, w: kittycat
o: bear, w: bottle
o: hand, w: yeah
o: baby, w: meow
o: dax, w: dax
o: man, w: daddy
o: boy, w: another
o: cow, w: moocow
o: face, w: all
o: bird, w: bigbird

Gold standard comparision

Note that f-score here is assessed via ANY non-zero counts.

We should think about the best way to do this for corpora



In [12]:

    
corpusfile = 'corpora/gold_standard.csv'
c_gs = Corpus(world = w, corpus = corpusfile)
get_f(l.ref, c_gs)









    Out[12]:





(0.08045977011494253, 0.6176470588235294, 0.14237288135593221)