You've just heard something about the log-likelihood ratio. Now let's code it. Below is the equation.
$log\text{ }\lambda = log (\frac{L(H_1\textrm{) # independence}}{L(H_2\textrm{) # dependence}})$
$= log\text{ }L(c_{12}, c_1, p)\text{ }+\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p)$
$\text{ }-\text{ }log\text{ }L(c_{12}, c_1, p_1)\text{ }-\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p_2)$
Where $L(k,n,x) = x^k(1-x)^{n-k}$
So $log\text{ }L(c_{12}, c_1, p) = log(p^{c_{12}}(1-p)^{c_1-c_{12}})$
$c_1$ = occurrences of word 1 (the target word $t$
$c_2$ = occurrences of word 2 (the co-occurrent $c$
$c_{12}$ = co-occurrences of word 1 with word 2 ($n(c,t)$)
$N$ = number of tokens in the text
$p = c_2/N$
$p_1 = c_{12}/c_1$
$p_2 = (c_2-c_{12})/(N-c_1)$
We run into a problem right away, however. The code above, which is taken directly from Manning & Schütze, is made to work with bigrams, i.e., words that occur together. For that purpose, we can simply use the $c_1$, $c_2$, and $N$ values from the text to calculate our answers.
But we have an 8-word window instead of a 1-word window, so we need to rethink how we derive our counts. Here, an article from Bullinaria and Levy comes to the rescue. We will talk about their equations to derive the counts in a second, but first we need to think about the change in thinking their method requires in us.
We need to stop thinking of our text as text, because it isn't any more. The data that we have are derived from the text, but they are not the text. So, if we want to do calculations on our data, we need to derive the counts from our data tables, not from our text.
And here's how:
The $n(c,t)$ should look familiar to you. $W$ is the total window size. I.e., if you are calculating 4L-4R, $W = 8$. The equation means that the frequency of $t$ is the sum for every $c$ (co-occurrent) of the number of times that $t$ (target word) and $c$ co-occur divided by the total size of the co-occurrence window. In our cooc_dict in lesson 3b we calculated the number of times that every $t$ co-occurred with every $c$. And we know the total window size. So that means that the co-occurrence DataFrame that you derived and saved from the texts in lesson 3b has the data that we need to calculate $c_1$.
Quiz:
Write a function below that takes as input a co-occurrence DataFrame and returns a Series with the counts of every $c_1$ based on the equation above.
Note: The calculation itself should take only one line of code!
In [11]:
import pandas as pd
import numpy as np
def c1_counter(df):
# insert your code here
return df.sum(axis = 1)/8
# Test your code below. Look at your results to make sure they make sense!
test_df = pd.read_pickle('./Data/blake-songs.txt.cooc.pickle') # I love the way Pandas reads pickles!
c1 = c1_counter(test_df)
print(c1)
The calculation for the $c_2$ looks similar but is just slightly different.
$$c_2 = f(c) = \frac {1}{W} \sum_t n(c,t)$$Quiz:
Try to interpret this equation yourself and then write a function, which will be slightly different from the c1_counter function, that takes as input a co-occurrence DataFrame and will calculate the counts for every $c_2$, returning a Pandas Series with these counts.
In [12]:
def c2_counter(df):
# insert your code here
return df.sum(axis = 0)/8
# Test your code below. Look at your results to make sure they make sense!
c2 = c2_counter(test_df)
print(c2)
While technically $c_1$ and $c_2$ should be computed differently because they rely on different data, for the data that we have...
In [13]:
len(test_df)
Out[13]:
In [14]:
c1.all() == c2.all()
Out[14]:
Since our co-occurrence table is symmetrical, it doesn't matter whether we sum our data row-wise or column-wise, we get the same answer. If the above printed False instead of True, you have done something wrong somewhere. Check your code!
Since we will use the co-occurrence counts that we calculated in lesson 3b from our pickled DataFrames, the last count that we need to calculate is that for $N$. The formula for calculating $N$ is:
$$N = \frac {1}{W} \sum_t\sum_c n(c,t)$$
Quiz:
Consider what this means and implement a function below that takes a co-occurrence DataFrame and returns the float $N$.
In [15]:
def N_counter(df):
# insert your code here
return df.values.sum()/8
# code to test your function below. Check your results.
N = N_counter(test_df)
print(N)
Quiz:
So now we have all the counts that we need, so it is time to calculate the probabilities $p$, $p_1$, and $p_2$. Define a function below that takes as input one row of a co-occurrence DataFrame (i.e., the $c_{12}$ values for a single word type), the count Series $c_1$ and $c_2$, and the scalar $N$ and returns the Series $p$, $p_1$, and $p_2$ containing the $p$ values for the target word.
Note: Vectorize!
In [16]:
print(c2.ix['the'])
print(c2.ix['10'])
test_df.ix['the', '10']
Out[16]:
In [17]:
def p_calc(c12, c1, c2, N):
#insert your code here
return c2/N, c12/c1, (c2-c12)/(N-c1)
# code to test your function below. Check your results
p, p1, p2 = p_calc(test_df.ix['the'], c1, c2, N)
for c in test_df.index[:10]:
print('p for "the" and "%s" == %s' % (c, p.ix[c]))
print('p1 for "the" and "%s" == %s' % (c, p1.ix[c]))
print('p2 for "the" and "%s" == %s' % (c, p2.ix[c]))
Now, we have all of the values that we need to actually implement the log-likelihood equation above. But we should write one more function before we jump into solving the equation. This function should deal with this sub-equation:
$L(k,n,x) = x^k(1-x)^{n-k}$
Quiz:
I bet you can figure out what to do below!
In [18]:
def L(k,n,x):
#insert your code here
return np.power(x,k)*np.power((1-x),(n-k))
And now, finally, we can code the equation itself.
$= log\text{ }L(c_{12}, c_1, p)\text{ }+\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p)$
$\text{ }-\text{ }log\text{ }L(c_{12}, c_1, p_1)\text{ }-\text{ }log\text{ }L(c_2-c_{12}, N-c_1, p_2)$
Because we have broken the process down into several steps, this function should be just a short as all of the functions above. Maybe even shorter. The function below should return a Series with the log-likelihood values for every co-occurrent $c$ in relation to one target word $t$.
Note: Because we can reference global variables inside a function, we do not need to pass any arguments to our function except the target word $t$.
In [19]:
def log_likelihood(t, df):
#insert your code here.
e1 = np.log(L(df.ix[t], c1, p))
e2 = np.log(L(c2 - test_df.ix[t], N-c1, p))
e3 = np.log(L(test_df.ix[t], c1, p1))
e4 = np.log(L(c2-test_df.ix[t], N-c1, p))
return e1 + e2 - e3 - e4
# To test your code. Check your results.
LL_series = log_likelihood('the')
print(-2*LL_series)
You may have noticed that some of the values in our LL_series object are NaN values. That means that, for whatever reason, Python was unable to compute them. Take a look at the data you are using and the results you are getting and try to figure out why this is happening. Take a moment to think about this problem and how you think you might be able to solve it.
On the basis of the discussions, edit your log_likelihood function below to deal with the NaN problem.
Notes: 64-bit float: $10^{-323}$, 128-bit float: $10^{-4950}$, Decimal float: $10^{-999,999,999,999,999,999}$
In [20]:
for prob in [p, p1, p2]:
prob[prob >= 1] = .99
print(prob[prob>=1])
In [21]:
def log_L(k,n,x):
#insert your code here
return (np.log(x)*k)+(np.log(1-x)*(n-k))
In [28]:
def log_likelihood(t, df):
#insert your code here.
c12 = df.ix[t]
e1 = log_L(c12, c1, p).replace([np.inf, -np.inf, np.nan], 0)
e2 = log_L(c2 - c12, N-c1, p).replace([np.inf, -np.inf, np.nan], 0)
e3 = log_L(c12, c1, p1).replace([np.inf, -np.inf, np.nan], 0)
e4 = log_L(c2-c12, N-c1, p2).replace([np.inf, -np.inf, np.nan], 0)
return -2*(e1 + e2 - e3 - e4)
# To test your code. Check your results.
LL_series = log_likelihood('the', test_df)
print(LL_series)
In [32]:
#LL_df = pd.DataFrame(index = test_df.index, columns = test_df.columns)
# initializing an empty DataFrame like this will reserve the necessary memory to build it
# But you shouldn't initialize until you are ready to build it, as we are here.
def create_LL_df(df):
ll_df = pd.DataFrame(index = df.index, columns = df.columns)
for t in df.index:
p, p1, p2 = p_calc(df.ix[t], c1, c2, N)
ll_df.ix[t] = log_likelihood(t, df)
return ll_df
In [33]:
LL_df = create_LL_df(test_df)
In [29]:
print(LL_df)
You have everything you need now to calculate the log-likelihood ratio for the co-occurrence DataFrames that we pickled in lesson 3b. So write a script below that loads each of these DataFrames, runs it through the gamut of functions above, produces a full log-likelihood DataFrame, and then pickles this DataFrame as an appropriately named .pickle file in the directory of your choosing.
Note: If you have cloned the original repository on Github, don't try to push after having produced all of these data files. I expect that some of them could be several hundred MB or even several GB.
Note2: If you are getting memory errors, just find the smallest co-occurrence DataFrame in your collection (if you are using the provided data, then probably blake-songs or blake-poems) and run the LL functions on that DF.
Note3: Even if you have vectorized everything perfectly, it will probably take several minutes for your code to run (maybe even much longer). So go grab a coffee while it is running.
In [ ]:
# Insert your code here.