In [2]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
import nltk
from tethne.readers import zotero
import matplotlib.pyplot as plt
from helpers import normalize_token, filter_token
import numpy as np
import pandas as pd

1.2.2. A Bayesian approach: the linear model

In the last few notebooks, we used a corpus that was generated from a Zotero collection. We read in our metadata from a Zotero RDF/XML document. Sometimes we either don't have access to additional metadata, or simply don't need it for our analysis. A very simple way to encode metadata is to include it in the file name. This isn't a very robust strategy, but for certain use-cases it does the job.



In [4]:

    
text_root = '../data/SystemsBiology'
documents = nltk.corpus.PlaintextCorpusReader(text_root, '.+.txt')

In this corpus, the publication year is encoded as the first four characters of the filename. For examples:



In [5]:

    
documents.fileids()[0]









    Out[5]:





'2003_1.txt'

So when we generate our conditional distributions, we can simply convert those first four characters to an integer.



In [1]:

    
int(documents.fileids()[0][:4])









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-5ff258213007> in <module>()
----> 1 int(documents.fileids()[0][:4])

NameError: name 'documents' is not defined

Review: OLS regression + permutation test

In the code-cells below, we repeat the procedure from the previous notebook to fit a linear regression model, and then perform a permutation test to assess significance.



In [2]:

    
word_counts_over_time = nltk.ConditionalFreqDist([
        (int(fileid[:4]), normalize_token(token))
         for fileid in documents.fileids()
         for token in documents.words(fileids=[fileid])
         if filter_token(token)
    ])









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-5873059ee2d0> in <module>()
----> 1 word_counts_over_time = nltk.ConditionalFreqDist([
      2         (int(fileid[:4]), normalize_token(token))
      3          for fileid in documents.fileids()
      4          for token in documents.words(fileids=[fileid])
      5          if filter_token(token)

NameError: name 'nltk' is not defined



In [3]:

    
word_probs_over_time = nltk.ConditionalProbDist(word_counts_over_time, nltk.MLEProbDist)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-66149826fa00> in <module>()
----> 1 word_probs_over_time = nltk.ConditionalProbDist(word_counts_over_time, nltk.MLEProbDist)

NameError: name 'nltk' is not defined



In [9]:

    
organism_data = pd.DataFrame(columns=['Year', 'Probability'])
for i, (year, counts) in enumerate(word_probs_over_time.items()):
    organism_data.loc[i] = [year, counts.prob('organism')]



In [10]:

    
organism_data









    Out[10]:






  
    
      
      Year
      Probability
    
  
  
    
      0
      2003
      0.002268
    
    
      1
      2004
      0.003743
    
    
      2
      2005
      0.002191
    
    
      3
      2006
      0.001522
    
    
      4
      2007
      0.000906
    
    
      5
      2008
      0.001088
    
    
      6
      2009
      0.000944
    
    
      7
      2010
      0.001748
    
    
      8
      2011
      0.002160
    
    
      9
      2012
      0.000882
    
    
      10
      2013
      0.000883



In [11]:

    
plt.scatter(organism_data.Year, organism_data.Probability)
plt.ylim(organism_data.Probability.min() - 0.001, 
         organism_data.Probability.max() + 0.001)
plt.ylabel('$\\hat{p}(\'organism\'|year)$')
plt.xlabel('Year')
plt.show()



In [12]:

    
from scipy.stats import linregress
Beta, Beta0, r, p, stde = linregress(organism_data.Year, organism_data.Probability)

print '^Beta:', Beta
print '^Beta_0:', Beta0
print 'r-squared:', r*r
print 'p:', p

plt.scatter(organism_data.Year, organism_data.Probability)
plt.plot(organism_data.Year, Beta0 + Beta*organism_data.Year, lw=2)
plt.ylim(organism_data.Probability.min() - 0.001, 
         organism_data.Probability.max() + 0.001)
plt.ylabel('$\\hat{p}(\'organism\'|year)$')
plt.show()









    



^Beta: -0.000163374366375
^Beta_0: 0.329722373929
r-squared: 0.37234130934
p: 0.0461889764899



In [13]:

    
samples = pd.DataFrame(columns=['Beta_pi', 'Beta0_pi'])
for i in xrange(1000):
    shuffled_probability = np.random.permutation(organism_data.Probability)
    Beta_pi, Beta0_pi, _, _, _ = linregress(organism_data.Year, shuffled_probability)
    samples.loc[i] = [Beta_pi, Beta0_pi]



In [14]:

    
plt.figure(figsize=(10, 5))

plt.subplot(121)
plt.hist(samples.Beta_pi)    # Histogram of Beta values from permutations.
plt.plot([Beta, Beta], [0, 200],     # Beta from the observed data.
         lw=5, label='$\\hat{\\beta}$')

# Plot the upper and lower bounds of the inner 95% probability.
Beta_upper = np.percentile(samples.Beta_pi, 97.5)
Beta_lower = np.percentile(samples.Beta_pi, 2.5)
plt.plot([Beta_upper, Beta_upper], [0, 200], color='k', lw=2, label='$p = 0.05$')
plt.plot([Beta_lower, Beta_lower], [0, 200], color='k', lw=2)

plt.legend()
plt.xlabel('$\\beta_{\\pi}$', fontsize=24)

# Same procedure for Beta0.
plt.subplot(122)
plt.hist(samples.Beta0_pi)
plt.plot([Beta0, Beta0], [0, 200], lw=5, label='$\\hat{\\beta_0}$')

Beta0_upper = np.percentile(samples.Beta0_pi, 97.5)
Beta0_lower = np.percentile(samples.Beta0_pi, 2.5)
plt.plot([Beta0_upper, Beta0_upper], [0, 200], color='k', lw=2, label='$p = 0.05$')
plt.plot([Beta0_lower, Beta0_lower], [0, 200], color='k', lw=2)

plt.legend()
plt.xlabel('$\\beta_{0\\pi}$', fontsize=24)
plt.tight_layout()
plt.show()



In [15]:

    
from scipy.stats import linregress
Beta, Beta0, r, p, stde = linregress(organism_data.Year, organism_data.Probability)

print '^Beta:', Beta
print '^Beta_0:', Beta0
print 'r-squared:', r*r
print 'p:', p

plt.scatter(organism_data.Year, organism_data.Probability)
plt.plot(organism_data.Year, Beta0 + Beta*organism_data.Year, lw=2)
plt.ylim(organism_data.Probability.min() - 0.001, 
         organism_data.Probability.max() + 0.001)
plt.ylabel('$\\hat{p}(\'organism\'|year)$')

# plt.fill_between()

plt.show()









    



^Beta: -0.000163374366375
^Beta_0: 0.329722373929
r-squared: 0.37234130934
p: 0.0461889764899

A Bayesian approach

Now we'll try to fit a similar linear model, this time in a Bayesian framework. One notable difference is that we will use raw word counts (not frequencies) modeled using a Poisson distribution (rather than a normal distribution). But we'll still fit a simple linear model.



In [16]:

    
import pymc



In [17]:

    
document_index = pd.DataFrame(columns=['Year', 'File'])
for i, fileid in enumerate(documents.fileids()):
    year = int(fileid[:4]) 
    document_index.loc[i] = [year, fileid]

Instead of raw frequencies, we'll use token counts. The Poisson assumes that we are counting events (tokens) within periods/areas of equal size. So we'll ignore document boundaries, and obtain counts over equally-sized chunks of each annual subcorpus.



In [18]:

    
organism_data = pd.DataFrame(columns=['Year', 'Count'])
i = 0
for year, indices in document_index.groupby('Year').groups.items():
    fileids = document_index.ix[indices].File
    tokens = [normalize_token(token)    # The tokens in this year.
              for token in documents.words(fileids=fileids)
              if filter_token(token)]    
            
    N = len(tokens)    # Number of tokens in this year.
    chunk_size = 200    # This shouldn't be too large.
    for x in xrange(0, N, chunk_size):
        current_chunk = tokens[x:x+chunk_size]    
        count = nltk.FreqDist(current_chunk)['organism']

        # Store the counts for this chunk as an observation.
        organism_data.loc[i] = [year, count]
        i += 1    # Increment the index variable.

The histograms below summarize the observed count data for each year.



In [19]:

    
organism_data.hist(column='Count', by='Year', figsize=(7, 7))
plt.show()

It's helpful to view the data as a single scatterplot. I've tuned down the opacity on points, so that it's easier to see relative density. The orange dots show the mean count for each year.



In [20]:

    
organism_data.plot(kind='scatter', x='Year', y='Count', 
                   alpha=0.4, label='Raw data')
means = organism_data.groupby('Year').mean()
plt.scatter(means.index, means.Count, 
            color='orange', s=50, label='Mean count')
plt.legend(loc='best')
plt.show()

Our linear model here is pretty similar to the original linear model, in that it takes the form $Y = \beta_0 + \beta X$. In this case, our response variable is the rate parameter $\lambda$ for the count data, which we model as Poisson. Instead of using raw years (2003, 2004, etc.), we'll use year-offsets (0, 1, 2, etc).



In [21]:

    
# Instead of using values in the 2000s, we'll use the
#  offset from the start. So 2003 is year 0, 2004 is
#  year 1, etc.
X = organism_data.Year - organism_data.Year.min()

# Intercept.
Beta0 = pymc.Normal('Beta0', 0., 20, value=1.)

# "Slope" / coefficient.
Beta = pymc.Normal('Beta', 0., 20, value=1.)

word_count = pymc.Poisson('word_count', 
                          mu=Beta0 + Beta*X,    # Here's the line.
                          value=organism_data.Count, 
                          observed=True)
model = pymc.Model({
    'Beta0': Beta0,
    'Beta': Beta,
    'word_count': word_count
})

M (below) is our MCMC simulation.



In [22]:

    
M = pymc.MCMC(model)



In [23]:

    
M.sample(iter=100000, burn=10000, thin=50)









    



 [-----------------100%-----------------] 100000 of 100000 complete in 33.6 sec



In [24]:

    
pymc.Matplot.plot(M)









    



Plotting Beta
Plotting Beta0



In [25]:

    
organism_data.plot(kind='scatter', x='Year', y='Count', alpha=0.4, label='Raw data')
means = organism_data.groupby('Year').mean()
plt.scatter(means.index, means.Count, color='orange', s=50, label='Mean count')

years = organism_data.Year.unique()
years_offset = years - years.min()
Beta0_mean = M.Beta0.trace()[:].mean()
Beta_mean = M.Beta.trace()[:].mean()
predicted_counts = Beta0_mean + Beta_mean*years_offset
plt.plot(years, predicted_counts, lw=2, label='Model')

plt.legend(loc='best')
plt.show()

	Year	Probability
0	2003	0.002268
1	2004	0.003743
2	2005	0.002191
3	2006	0.001522
4	2007	0.000906
5	2008	0.001088
6	2009	0.000944
7	2010	0.001748
8	2011	0.002160
9	2012	0.000882
10	2013	0.000883