Input modeling


In [1]:
%load_ext rmagic

In the simualtions so far, we've been generating training data by sampling from Ramscar et al.'s counts as if they were a multinomial distribution. That is, each number between 1 and 15 is selected with a probability proportional to the number of times Ramscar et al. found that number name mentioned in their corpus survey (e.g., they counted 455 mentions of one out of 998 total mentions of numbers between 1 and 15, so each training instance in our sample has a cardinality 1 with a probability of 455/998=0.456.

This isn't great, for lots of reasons. A big problem is that we never get sets with cardinality greater than 15. In real life, presumably, small counts are much more frequent than large counts, but even so, there are lots of possible large counts. Thus, the total probability mass of counts >15 could be pretty large, which in turn means we may be really skewing things. Also, since we're just using Ramscar et al.'s empirical distribution, it's hard to manipulate that in the ways the Jeremy was recommending.

We also don't know whether the distribution of mentions of number is anything like the distribution of numbers. That is, there may be no relation between the frequency with which a language user encounters a set of three objects and the frequency with which language users say the word three. Yet we're using the latter as an estimate for the former. Is that justfied?

So, to move forward, it'll be helpful to have a theoretically justifiable model for the distribution of set cardinalities. They're counts, so the obvious choice is a Poisson distribution. But, there's a problem: because of the nature of the data, we've got no zero counts. The empty set never comes up in training. We could try a variant, though. The zero-truncated Poisson (ZTP) is like a Poisson distribition, but with no zero counts. Fitting a ZTP distribution in R is pretty easy. First get the data:


In [2]:
%%R
library(ndl)
data(numbers)
numbers$Outcomes <- type.convert(numbers$Outcomes)


This is ndl version 0.2.16. 
For an overview of the package, type 'help("ndl.package")'.

If the counts really are distributed following the ZTP distribution, then the mean should be equal to the variance. As a first step, we can check to see if that relationshop holds:


In [3]:
%R mean(rep(numbers$Outcomes,numbers$Frequency))


Out[3]:
array([ 2.67635271])

In [4]:
%R var(rep(numbers$Outcomes,numbers$Frequency))


Out[4]:
array([ 6.14489661])

Nope. The variance is much too high. This isn't too surprising. Given how many different kinds of contexts are represented in the corpora that Ramscar looked at, it's reasonable to expect the counts to be overdispersed. This suggests that the ZTP distribution won't be a good fit, but let's try it anyway:


In [5]:
%%R

library(VGAM)
m1 <- vglm(Outcomes~1,family="pospoisson",weights=Frequency,data=numbers)
summary(m1)


Loading required package: splines
Loading required package: stats4

Attaching package: ‘VGAM’

The following object is masked from ‘package:stats4’:

    coef

The following objects are masked from ‘package:splines’:

    bs, ns

The following objects are masked from ‘package:stats’:

    case.names, coef, coefficients, df.residual, dfbeta, fitted,
    fitted.values, formula, hatvalues, poly, residuals, variable.names,
    weights

The following objects are masked from ‘package:base’:

    identity, scale.default


Call:
vglm(formula = Outcomes ~ 1, family = "pospoisson", data = numbers, 
    weights = Frequency)

Pearson residuals:
                Min     1Q Median     3Q    Max
log(lambda) -24.944 8.6047 11.462 14.908 20.436

Coefficients:
            Estimate Std. Error z value
(Intercept)  0.89364   0.022083  40.468

Number of linear predictors:  1 

Name of linear predictor: log(lambda) 

Dispersion Parameter for pospoisson family:   1

Log-likelihood: -2101.056 on 14 degrees of freedom

Number of iterations: 5 

A hanging rootogram is a diagnostic plot for comparing empirical counts with a theoretical distribution:


In [6]:
%%R

library(vcd)
rootogram(numbers$Frequency,dpospois(1:15,exp(0.89364))*998)


Loading required package: grid

Not good at all. As an alternative, we can try the Zero-Truncated Negative Binomial distribution (the negative binomial is a generalization of the Poisson with an additional parameter to allow for overdispersion):


In [7]:
%%R
m2 <- vglm(Outcomes~1,family="posnegbinomial",weights=Frequency,data=numbers)
summary(m2)


Call:
vglm(formula = Outcomes ~ 1, family = "posnegbinomial", data = numbers, 
    weights = Frequency)

Pearson residuals:
              Min      1Q  Median     3Q    Max
log(munb) -20.836  3.0551  6.5328 7.6187 9.3196
log(size) -10.147 -6.3514 -4.5322 1.2671 9.2883

Coefficients:
              Estimate Std. Error z value
(Intercept):1 -0.53464    0.38655 -1.3831
(Intercept):2 -1.82627    0.48355 -3.7768

Number of linear predictors:  2 

Names of linear predictors: log(munb), log(size)

Dispersion Parameter for posnegbinomial family:   1

Log-likelihood: -1740.027 on 28 degrees of freedom

Number of iterations: 6 

In [8]:
%%R
rootogram(numbers$Frequency,dposnegbin(1:15,size=exp(-1.82625),munb=exp(-0.53464))*998)


Much better! This distribution is theoretically justifiable and is a good fit to the observed counts, so instead of re-sampling training data from the corpus counts, we can generate (hypothetical, simulated) training data be generating random ZTNB variates. We can also see what the probably of getting a set with more than 15 elements in it is:


In [9]:
%%R
1-pposnegbin(15,munb=exp(-0.53481),size=exp(-1.82653))


[1] 0.004958314

The choice of distribution for modeling the inputs seems solid. But what about the parameters? We've been using the data from the ndl package in R, which is allegedly from an unpublished paper by Ramscar et al. The numbers they report in the PNAS paper are slightly different. How well do those fit?

First we'll check the Spanish data from Table 1:


In [16]:
%%R

num <- 1:7
spanish <- c(1079,928,469,249,234,155,86)
ms <- vglm(num~1,weights=spanish,family=posnegbinomial())
summary(ms)


Call:
vglm(formula = num ~ 1, family = posnegbinomial(), weights = spanish)

Pearson residuals:
              Min        1Q  Median     3Q    Max
log(munb) -32.885   0.66876 15.4914 22.884 24.817
log(size) -25.524 -14.86999  1.6291 14.003 20.377

Coefficients:
              Estimate Std. Error z value
(Intercept):1  0.67067   0.022924  29.256
(Intercept):2  1.10129   0.107875  10.209

Number of linear predictors:  2 

Names of linear predictors: log(munb), log(size)

Dispersion Parameter for posnegbinomial family:   1

Log-likelihood: -5351.886 on 12 degrees of freedom

Number of iterations: 6 

In [18]:
%R rootogram(spanish,dposnegbin(1:7,size=exp(1.10129),munb=exp(0.67067))*sum(spanish))


Not too bad. Ramscar et al. describe this as an 'inverse power relation'. If that were really what's going on, we'd expect to see a straight line on a plot of rank vs. frequency, which we don't:


In [13]:
%R plot(1:7,spanish,log='xy',type='l')


Now let's check the English counts:


In [15]:
%%R

english <- c(856,707,368,224,204,156,87)
me <- vglm(num~1,weights=english,family=posnegbinomial())
summary(me)


Call:
vglm(formula = num ~ 1, family = posnegbinomial(), weights = english)

Pearson residuals:
              Min       1Q  Median     3Q    Max
log(munb) -29.007 -0.54148 12.7489 19.550 21.972
log(size) -25.120 -7.43579  5.7971 11.691 16.939

Coefficients:
              Estimate Std. Error z value
(Intercept):1  0.72634   0.025027 29.0222
(Intercept):2  1.02552   0.103180  9.9392

Number of linear predictors:  2 

Names of linear predictors: log(munb), log(size)

Dispersion Parameter for posnegbinomial family:   1

Log-likelihood: -4500.582 on 12 degrees of freedom

Number of iterations: 5 

In [17]:
%R rootogram(english,dposnegbin(1:7,size=exp(1.02552),munb=exp(0.72634))*sum(english))


Again, not too bad. But not perfect, either. It's suspicious that the model overestimates the frequencies for 3 and 4 and underestimates 5, 6, and 7 in both Spanish and English. I have a feeling the model isn't such a perfect fit after all (once again, reality trumps statistics!). But it's still pretty good, with very similar results for the two languages. And, very different results (at least in terms of the parameter estimates) from the ndl dataset we've been using.


In [ ]: