In [1]:
%load_ext rmagic
In the simualtions so far, we've been generating training data by sampling from Ramscar et al.'s counts as if they were a multinomial distribution. That is, each number between 1 and 15 is selected with a probability proportional to the number of times Ramscar et al. found that number name mentioned in their corpus survey (e.g., they counted 455 mentions of one out of 998 total mentions of numbers between 1 and 15, so each training instance in our sample has a cardinality 1 with a probability of 455/998=0.456.
This isn't great, for lots of reasons. A big problem is that we never get sets with cardinality greater than 15. In real life, presumably, small counts are much more frequent than large counts, but even so, there are lots of possible large counts. Thus, the total probability mass of counts >15 could be pretty large, which in turn means we may be really skewing things. Also, since we're just using Ramscar et al.'s empirical distribution, it's hard to manipulate that in the ways the Jeremy was recommending.
We also don't know whether the distribution of mentions of number is anything like the distribution of numbers. That is, there may be no relation between the frequency with which a language user encounters a set of three objects and the frequency with which language users say the word three. Yet we're using the latter as an estimate for the former. Is that justfied?
So, to move forward, it'll be helpful to have a theoretically justifiable model for the distribution of set cardinalities. They're counts, so the obvious choice is a Poisson distribution. But, there's a problem: because of the nature of the data, we've got no zero counts. The empty set never comes up in training. We could try a variant, though. The zero-truncated Poisson (ZTP) is like a Poisson distribition, but with no zero counts. Fitting a ZTP distribution in R is pretty easy. First get the data:
In [2]:
%%R
library(ndl)
data(numbers)
numbers$Outcomes <- type.convert(numbers$Outcomes)
If the counts really are distributed following the ZTP distribution, then the mean should be equal to the variance. As a first step, we can check to see if that relationshop holds:
In [3]:
%R mean(rep(numbers$Outcomes,numbers$Frequency))
Out[3]:
In [4]:
%R var(rep(numbers$Outcomes,numbers$Frequency))
Out[4]:
Nope. The variance is much too high. This isn't too surprising. Given how many different kinds of contexts are represented in the corpora that Ramscar looked at, it's reasonable to expect the counts to be overdispersed. This suggests that the ZTP distribution won't be a good fit, but let's try it anyway:
In [5]:
%%R
library(VGAM)
m1 <- vglm(Outcomes~1,family="pospoisson",weights=Frequency,data=numbers)
summary(m1)
A hanging rootogram is a diagnostic plot for comparing empirical counts with a theoretical distribution:
In [6]:
%%R
library(vcd)
rootogram(numbers$Frequency,dpospois(1:15,exp(0.89364))*998)
Not good at all. As an alternative, we can try the Zero-Truncated Negative Binomial distribution (the negative binomial is a generalization of the Poisson with an additional parameter to allow for overdispersion):
In [7]:
%%R
m2 <- vglm(Outcomes~1,family="posnegbinomial",weights=Frequency,data=numbers)
summary(m2)
In [8]:
%%R
rootogram(numbers$Frequency,dposnegbin(1:15,size=exp(-1.82625),munb=exp(-0.53464))*998)
Much better! This distribution is theoretically justifiable and is a good fit to the observed counts, so instead of re-sampling training data from the corpus counts, we can generate (hypothetical, simulated) training data be generating random ZTNB variates. We can also see what the probably of getting a set with more than 15 elements in it is:
In [9]:
%%R
1-pposnegbin(15,munb=exp(-0.53481),size=exp(-1.82653))
The choice of distribution for modeling the inputs seems solid. But what about the parameters? We've been using the data from the ndl package in R, which is allegedly from an unpublished paper by Ramscar et al. The numbers they report in the PNAS paper are slightly different. How well do those fit?
First we'll check the Spanish data from Table 1:
In [16]:
%%R
num <- 1:7
spanish <- c(1079,928,469,249,234,155,86)
ms <- vglm(num~1,weights=spanish,family=posnegbinomial())
summary(ms)
In [18]:
%R rootogram(spanish,dposnegbin(1:7,size=exp(1.10129),munb=exp(0.67067))*sum(spanish))
Not too bad. Ramscar et al. describe this as an 'inverse power relation'. If that were really what's going on, we'd expect to see a straight line on a plot of rank vs. frequency, which we don't:
In [13]:
%R plot(1:7,spanish,log='xy',type='l')
Now let's check the English counts:
In [15]:
%%R
english <- c(856,707,368,224,204,156,87)
me <- vglm(num~1,weights=english,family=posnegbinomial())
summary(me)
In [17]:
%R rootogram(english,dposnegbin(1:7,size=exp(1.02552),munb=exp(0.72634))*sum(english))
Again, not too bad. But not perfect, either. It's suspicious that the model overestimates the frequencies for 3 and 4 and underestimates 5, 6, and 7 in both Spanish and English. I have a feeling the model isn't such a perfect fit after all (once again, reality trumps statistics!). But it's still pretty good, with very similar results for the two languages. And, very different results (at least in terms of the parameter estimates) from the ndl dataset we've been using.
In [ ]: