The Log-Likelihood Ratio for Statistical Evaluation of Co-occurrence Counts

You've just heard something about the log-likelihood ratio. Now let's code it. Below is the equation.

$log\text{ }\lambda = log (\frac{L(H_1\textrm{) # independence}}{L(H_2\textrm{) # dependence}})$

$= log\text{ }L(c_{12}, c_1, p)\text{ }+\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p)$
$\text{ }-\text{ }log\text{ }L(c_{12}, c_1, p_1)\text{ }-\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p_2)$

Where $L(k,n,x) = x^k(1-x)^{n-k}$

So $log\text{ }L(c_{12}, c_1, p) = log(p^{c_{12}}(1-p)^{c_1-c_{12}})$

$c_1$ = occurrences of word 1 (the target word $t$
$c_2$ = occurrences of word 2 (the co-occurrent $c$
$c_{12}$ = co-occurrences of word 1 with word 2 ($n(c,t)$)
$N$ = number of tokens in the text

$p = c_2/N$
$p_1 = c_{12}/c_1$
$p_2 = (c_2-c_{12})/(N-c_1)$

Deriving the counts.

We run into a problem right away, however. The code above, which is taken directly from Manning & Schütze, is made to work with bigrams, i.e., words that occur together. For that purpose, we can simply use the $c_1$, $c_2$, and $N$ values from the text to calculate our answers.

But we have an 8-word window instead of a 1-word window, so we need to rethink how we derive our counts. Here, an article from Bullinaria and Levy comes to the rescue. We will talk about their equations to derive the counts in a second, but first we need to think about the change in thinking their method requires in us.

We need to stop thinking of our text as text, because it isn't any more. The data that we have are derived from the text, but they are not the text. So, if we want to do calculations on our data, we need to derive the counts from our data tables, not from our text.
And here's how:

$$c_1 = f(t) = \frac {1}{W} \sum_c n(c,t)$$

The $n(c,t)$ should look familiar to you. $W$ is the total window size. I.e., if you are calculating 4L-4R, $W = 8$. The equation means that the frequency of $t$ is the sum for every $c$ (co-occurrent) of the number of times that $t$ (target word) and $c$ co-occur divided by the total size of the co-occurrence window. In our cooc_dict in lesson 3b we calculated the number of times that every $t$ co-occurred with every $c$. And we know the total window size. So that means that the co-occurrence DataFrame that you derived and saved from the texts in lesson 3b has the data that we need to calculate $c_1$.

Quiz:

Write a function below that takes as input a co-occurrence DataFrame and returns a Series with the counts of every $c_1$ based on the equation above.

Note: The calculation itself should take only one line of code!


In [11]:
import pandas as pd
import numpy as np

def c1_counter(df):
    # insert your code here
    return df.sum(axis = 1)/8

# Test your code below. Look at your results to make sure they make sense!
test_df = pd.read_pickle('./Data/blake-songs.txt.cooc.pickle') # I love the way Pandas reads pickles!
c1 = c1_counter(test_df)
print(c1)


1     1
10    1
12    1
13    1
14    1
15    1
17    1
19    1
20    1
23    1
25    1
26    1
27    1
29    1
3     1
...
‘sweet     1
‘then      1
‘they      1
‘thou      1
‘turn      1
‘weep      3
‘well      1
‘what      1
‘where     1
‘wrath     1
’         33
’twas      1
’—         1
“come      1
”’         1
Length: 1388, dtype: float64

The calculation for the $c_2$ looks similar but is just slightly different.

$$c_2 = f(c) = \frac {1}{W} \sum_t n(c,t)$$

Quiz:

Try to interpret this equation yourself and then write a function, which will be slightly different from the c1_counter function, that takes as input a co-occurrence DataFrame and will calculate the counts for every $c_2$, returning a Pandas Series with these counts.


In [12]:
def c2_counter(df):
    # insert your code here
    return df.sum(axis = 0)/8

# Test your code below. Look at your results to make sure they make sense!
c2 = c2_counter(test_df)
print(c2)


1     1
10    1
12    1
13    1
14    1
15    1
17    1
19    1
20    1
23    1
25    1
26    1
27    1
29    1
3     1
...
‘sweet     1
‘then      1
‘they      1
‘thou      1
‘turn      1
‘weep      3
‘well      1
‘what      1
‘where     1
‘wrath     1
’         33
’twas      1
’—         1
“come      1
”’         1
Length: 1388, dtype: float64

While technically $c_1$ and $c_2$ should be computed differently because they rely on different data, for the data that we have...


In [13]:
len(test_df)


Out[13]:
1388

In [14]:
c1.all() == c2.all()


Out[14]:
True

Since our co-occurrence table is symmetrical, it doesn't matter whether we sum our data row-wise or column-wise, we get the same answer. If the above printed False instead of True, you have done something wrong somewhere. Check your code!

Since we will use the co-occurrence counts that we calculated in lesson 3b from our pickled DataFrames, the last count that we need to calculate is that for $N$. The formula for calculating $N$ is:

$$N = \frac {1}{W} \sum_t\sum_c n(c,t)$$

Quiz:

Consider what this means and implement a function below that takes a co-occurrence DataFrame and returns the float $N$.


In [15]:
def N_counter(df):
    # insert your code here
    return df.values.sum()/8

# code to test your function below.  Check your results.
N = N_counter(test_df)
print(N)


5636.5

Calculate the probabilities.

Quiz:

So now we have all the counts that we need, so it is time to calculate the probabilities $p$, $p_1$, and $p_2$. Define a function below that takes as input one row of a co-occurrence DataFrame (i.e., the $c_{12}$ values for a single word type), the count Series $c_1$ and $c_2$, and the scalar $N$ and returns the Series $p$, $p_1$, and $p_2$ containing the $p$ values for the target word.

Note: Vectorize!


In [16]:
print(c2.ix['the'])
print(c2.ix['10'])
test_df.ix['the', '10']


357.0
1.0
Out[16]:
2.0

In [17]:
def p_calc(c12, c1, c2, N):
    #insert your code here
    return c2/N, c12/c1, (c2-c12)/(N-c1)

# code to test your function below. Check your results
p, p1, p2 = p_calc(test_df.ix['the'], c1, c2, N)
for c in test_df.index[:10]:
    print('p for "the" and "%s" == %s' % (c, p.ix[c]))
    print('p1 for "the" and "%s" == %s' % (c, p1.ix[c]))
    print('p2 for "the" and "%s" == %s' % (c, p2.ix[c]))


p for "the" and "1" == 0.000177415062539
p1 for "the" and "1" == 2.0
p2 for "the" and "1" == -0.000177446544229
p for "the" and "10" == 0.000177415062539
p1 for "the" and "10" == 2.0
p2 for "the" and "10" == -0.000177446544229
p for "the" and "12" == 0.000177415062539
p1 for "the" and "12" == 2.0
p2 for "the" and "12" == -0.000177446544229
p for "the" and "13" == 0.000177415062539
p1 for "the" and "13" == 1.0
p2 for "the" and "13" == 0.0
p for "the" and "14" == 0.000177415062539
p1 for "the" and "14" == 0.0
p2 for "the" and "14" == 0.000177446544229
p for "the" and "15" == 0.000177415062539
p1 for "the" and "15" == 1.0
p2 for "the" and "15" == 0.0
p for "the" and "17" == 0.000177415062539
p1 for "the" and "17" == 1.0
p2 for "the" and "17" == 0.0
p for "the" and "19" == 0.000177415062539
p1 for "the" and "19" == 0.0
p2 for "the" and "19" == 0.000177446544229
p for "the" and "20" == 0.000177415062539
p1 for "the" and "20" == 0.0
p2 for "the" and "20" == 0.000177446544229
p for "the" and "23" == 0.000177415062539
p1 for "the" and "23" == 0.0
p2 for "the" and "23" == 0.000177446544229

Code the equation itself.

Now, we have all of the values that we need to actually implement the log-likelihood equation above. But we should write one more function before we jump into solving the equation. This function should deal with this sub-equation:

$L(k,n,x) = x^k(1-x)^{n-k}$

Quiz:

I bet you can figure out what to do below!


In [18]:
def L(k,n,x):
    #insert your code here
    return np.power(x,k)*np.power((1-x),(n-k))

And now, finally, we can code the equation itself.

$= log\text{ }L(c_{12}, c_1, p)\text{ }+\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p)$
$\text{ }-\text{ }log\text{ }L(c_{12}, c_1, p_1)\text{ }-\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p_2)$

Because we have broken the process down into several steps, this function should be just a short as all of the functions above. Maybe even shorter. The function below should return a Series with the log-likelihood values for every co-occurrent $c$ in relation to one target word $t$.

Note: Because we can reference global variables inside a function, we do not need to pass any arguments to our function except the target word $t$.


In [19]:
def log_likelihood(t, df):
    #insert your code here.
    e1 = np.log(L(df.ix[t], c1, p))
    e2 = np.log(L(c2 - test_df.ix[t], N-c1, p))
    e3 = np.log(L(test_df.ix[t], c1, p1))
    e4 = np.log(L(c2-test_df.ix[t], N-c1, p))
    return e1 + e2 - e3 - e4

# To test your code. Check your results.
LL_series = log_likelihood('the')
print(-2*LL_series)


1           NaN
10          NaN
12          NaN
13    17.274037
14     0.000355
15    17.274037
17    17.274037
19     0.000355
20     0.000355
23     0.000355
25     0.000355
26     0.000355
27     0.000355
29     0.000355
3           NaN
...
‘sweet      0.000355
‘then       0.000355
‘they       0.000355
‘thou       0.000355
‘turn      17.274037
‘weep      26.335605
‘well       0.000355
‘what      17.274037
‘where      0.000355
‘wrath     17.274037
’         139.759944
’twas       0.000355
’—          0.000355
“come      17.274037
”’          0.000355
Length: 1388, dtype: float64

Dealing with NaN values.

You may have noticed that some of the values in our LL_series object are NaN values. That means that, for whatever reason, Python was unable to compute them. Take a look at the data you are using and the results you are getting and try to figure out why this is happening. Take a moment to think about this problem and how you think you might be able to solve it.

Discuss, discuss, discuss, discuss.

On the basis of the discussions, edit your log_likelihood function below to deal with the NaN problem.

Notes: 64-bit float: $10^{-323}$, 128-bit float: $10^{-4950}$, Decimal float: $10^{-999,999,999,999,999,999}$


In [20]:
for prob in [p, p1, p2]:
    prob[prob >= 1] = .99
    print(prob[prob>=1])


Series([], dtype: float64)
Series([], dtype: float64)
Series([], dtype: float64)

In [21]:
def log_L(k,n,x):
    #insert your code here
    return (np.log(x)*k)+(np.log(1-x)*(n-k))

In [28]:
def log_likelihood(t, df):
    #insert your code here.
    c12 = df.ix[t]
    e1 = log_L(c12, c1, p).replace([np.inf, -np.inf, np.nan], 0)
    e2 = log_L(c2 - c12, N-c1, p).replace([np.inf, -np.inf, np.nan], 0)
    e3 = log_L(c12, c1, p1).replace([np.inf, -np.inf, np.nan], 0)
    e4 = log_L(c2-c12, N-c1, p2).replace([np.inf, -np.inf, np.nan], 0)
    return -2*(e1 + e2 - e3 - e4)

# To test your code. Check your results.
LL_series = log_likelihood('the', test_df)
print(LL_series)


1     28.443999
10    28.443999
12    28.443999
13    19.253759
14     0.000355
15    19.253759
17    19.253759
19     0.000355
20     0.000355
23     0.000355
25     0.000355
26     0.000355
27     0.000355
29     0.000355
3     28.443999
...
‘sweet      0.000355
‘then       0.000355
‘they       0.000355
‘thou       0.000355
‘turn      19.253759
‘weep      28.136961
‘well       0.000355
‘what      19.253759
‘where      0.000355
‘wrath     19.253759
’         151.952830
’twas       0.000355
’—          0.000355
“come      19.253759
”’          0.000355
Length: 1388, dtype: float64

Putting it all together!

This last piece of script should be quite easy now that you have coded all of the functions above. The last step in the process is to write a script that will go through all the target words $t$ in your corpus, calculate the log-likelihood vectors for all of them, and then put them together into a new DataFrame that has the same shape as your co-occurrence DataFrame but is, instead, filled with log-likelihood values. I will get you started.


In [32]:
#LL_df = pd.DataFrame(index = test_df.index, columns = test_df.columns)
# initializing an empty DataFrame like this will reserve the necessary memory to build it
# But you shouldn't initialize until you are ready to build it, as we are here.
def create_LL_df(df):
    ll_df = pd.DataFrame(index = df.index, columns = df.columns)
    for t in df.index:
        p, p1, p2 = p_calc(df.ix[t], c1, c2, N)
        ll_df.ix[t] = log_likelihood(t, df)
    return ll_df

In [33]:
LL_df = create_LL_df(test_df)

In [29]:
print(LL_df)


None

Final Activity

You have everything you need now to calculate the log-likelihood ratio for the co-occurrence DataFrames that we pickled in lesson 3b. So write a script below that loads each of these DataFrames, runs it through the gamut of functions above, produces a full log-likelihood DataFrame, and then pickles this DataFrame as an appropriately named .pickle file in the directory of your choosing.

Note: If you have cloned the original repository on Github, don't try to push after having produced all of these data files. I expect that some of them could be several hundred MB or even several GB.

Note2: If you are getting memory errors, just find the smallest co-occurrence DataFrame in your collection (if you are using the provided data, then probably blake-songs or blake-poems) and run the LL functions on that DF.

Note3: Even if you have vectorized everything perfectly, it will probably take several minutes for your code to run (maybe even much longer). So go grab a coffee while it is running.


In [ ]:
# Insert your code here.