"You shall know a word by the company it keeps!" These are the oft-quoted words of the linguist J.R. Firth in describing the meaning and spirit of collocational analysis. Collocation is a linguistic term for co-occuring. While most words have the possibility of co-occuring with most other words at some point in the English language, when there is a significant statistical relationship between two regularly co-occuring words, we can refer to these as collocates. One of the first, and most cited examples of collocational analysis concerns the words strong
and powerful
. While both words mean arguably the same thing, it is statistically more common to see the word strong
co-occur with the word tea
. Native speakers of English can immediately recognize the familiarity of strong tea
as opposed to powerful tea
, even though the two phrases both make sense in their own way (see Halliday, 1966 for more of this discussion). Interestingly, the same associations do not occur with the phrases strong men
and powerful men
, although in these instances, both phrases take on slightly different meanings.
These examples highlight the belief of Firthian linguists that the meaning of a word is not confined to the word itself, but in the associations that words have with other co-occuring words. Statistically significant collocates need not be adjacent, just proximal. The patterns of the words in a text, rather than the individual words themselves, have complex, relational units of meaning that allow us to ask questions about the use of lanugage in specific discourses.
In this exercise we will determine the statistical significance of the words that most often co-occur with privacy
in an attempt to better understand the meaning of the word as it is used in the Hansard Corpus. We will count the actual frequency of the co-occurence, as well as use a number of different statistical tests of probability. These tests will be conducted first on one file from the corpus, then on the entire corpus itself.
This section will determine the statistically significant collocates that accompany the word privacy
in the file for 2015. Testing file-by-file allows us to track the diachronic (time-based) change and use of the words.
Again, we'll begin by calling on all the functions we will need. Remember that the first few sentences are calling on pre-installed Python modules, and anything with a def
at the beginning is a custom function built specifically for these exercises. The text in red describes the purpose of the function.
In [1]:
# This is where the modules are imported
import csv
import sys
import codecs
import nltk
import nltk.collocations
import collections
import statistics
from nltk.metrics.spearman import *
from nltk.collocations import *
from nltk.stem import WordNetLemmatizer
from os import listdir
from os.path import splitext
from os.path import basename
from tabulate import tabulate
# These functions iterate through the directory and create a list of filenames
def list_textfiles(directory):
"Return a list of filenames ending in '.txt'"
textfiles = []
for filename in listdir(directory):
if filename.endswith(".txt"):
textfiles.append(directory + "/" + filename)
return textfiles
def remove_ext(filename):
"Removes the file extension, such as .txt"
name, extension = splitext(filename)
return name
def remove_dir(filepath):
"Removes the path from the file name"
name = basename(filepath)
return name
def get_filename(filepath):
"Removes the path and file extension from the file name"
filename = remove_ext(filepath)
name = remove_dir(filename)
return name
# This function works on the contents of the files
def read_file(filename):
"Read the contents of FILENAME and return as a string."
infile = codecs.open(filename, 'r', 'utf-8')
contents = infile.read()
infile.close()
return contents
Collocational analysis is a frequency-based technique that uses word counts to determine significance. One of the problems with counting word frequencies, as we have seen in other sections, is that the most frequently occuring words in English are function words, like the
, of
, and and
. For this reason, it is neccessary to remove these words in order to obtain meaningful results. In text analysis, these high frequency words are compiled into lists called stopwords
. While standard stopword lists are provided by the NLTK
module, for the Hansard Corpus it was necessary to remove other kinds of words, like proper nouns (names and place names), and other words common to parliamentary proceedings (like Prime Minister, Speaker, etc.). These words, along with the standard stopwords, can be seen below.
Here, we use our read_file
function to read in a text file of custom stopwords, assigning the variable customStopwords
. We tokenize the list using the split
function and then create a variable called hansardStopwords
that incorporates the NLTK
stopword list, and adding the words from customStopwords
if they don't already occur in the NLTK
list.
In [2]:
stopwords = read_file('../HansardStopwords.txt')
customStopwords = stopwords.split()
#default stopwords with custom words added
hansardStopwords = nltk.corpus.stopwords.words('english')
hansardStopwords += customStopwords
In [3]:
print(hansardStopwords)
Now, we use read_file
to load the contents of the file for 2015. For consistency and to avoid file duplication, we're always reading the files from the same directory. Even though it was used for other sections, the data is the same. We read the contents of the text file, then remove the case and punctuation from the text, split the words into a list of tokens, and assign the words in each file to a list with the variable name text
. What's new here, compared to other sections, is the additional removal of stopwords.
In [3]:
file = '../Counting Word Frequencies/data/2009.txt'
name = get_filename(file)
In [4]:
# opens, reads, and tokenizes the file
text = read_file(file)
words = text.split()
clean = [w.lower() for w in words if w.isalpha()]
# removes stopwords
text = [w for w in clean if w not in hansardStopwords]
Another type of processing required for the generation of accurate collocational statistics is called lemmatization. In linguistics, a lemma is the grammatical base or stem of a word. For example, the word protect
is the lemma of the verbs protecting
and protected
, while ethic
is the lemma of the noun ethics
. When we lemmatize a text, we are removing the grammatical inflections of the word forms (like ing
or ed
). The purpose of lemmatization for the Hansard Corpus is to obtain more accurate statistics for collocation by avoiding multiple entries for similar but different word forms (like protecting, protected, and protect). For the purpose of this text analysis, I have decided to lemmatize only the nouns and verbs in the Hansard Corpus, as the word privacy
is not easily modified by adjectives (or at all by adverbs).
The lemmatizer I have used for this project was developed by Princeton and is called WordNet. Lemmas and their grammatical inflections can be searched using their web interface.
In the code below, I load the WordNetLemmatizer (another function included in the NLTK
module) into the variable wnl
. Then, I iterate through the text, first lematizing the verbs (shown as v
), then the nouns (shown as n
). Unfortunately, the WordNet function only takes one argument, so this code requires two pass-throughs of the text. I'm sure there is a more elegant way to construct this code, though I've not found it yet. This is another reason why I've decided only to lemmatize verbs and nouns, rather than including adjectives and adverbs.
In [5]:
# creates a variable for the lemmatizing function
wnl = WordNetLemmatizer()
# lemmatizes all of the verbs
lemm = []
for word in text:
lemm.append(wnl.lemmatize(word, 'v'))
# lemmatizes all of the nouns
lems = []
for word in lemm:
lems.append(wnl.lemmatize(word, 'n'))
In [50]:
print("Number of words:", len(lems))
We need to make sure that the lemmatizer did something. Since we've only lemmatized for nouns and verbs, we check that here against the unlemmatized corpus, where text
has not been lemmatized and lems
has. Below we see that noun ethics
appears 156 times in the text
variable and 0 times in the lems
variable. But the lemma for ethics
: ethic
, remains in the lems
variable for a frequency of 161 times. Similar values are repeated for the verb and variations of protect
.
In [7]:
print('NOUNS')
print('ethics:', text.count('ethics'))
print('ethics:', lems.count('ethics'))
print('ethic:', lems.count('ethic'))
print('\n')
print('VERBS')
print('protecting:', text.count('protecting'))
print('protecting:', lems.count('protecting'))
print('protected:', text.count('protected'))
print('protected:', lems.count('protected'))
print('protect:', lems.count('protect'))
Here we check that the lemmatizer hasn't been over-zealous by determining the frequency for privacy
before and after the lemmatizing function. The frequencies are the same, meaning we've not lost anything in the lemmatization.
In [8]:
print('privacy:', text.count('privacy'))
print('privacy:', lems.count('privacy'))
Let's clarify some of the words we will be using in the rest of this exercise:
ngram
= catch-all term for multiple word occurencesbigram
= word pairstrigram
= three-word phrasesAfter the stopwords have been removed and the nouns and verbs lemmatized, we are ready to determine statistics for co-occuring words, or collocates. Any collocational test requires four pieces of data: the length of the text in which the words appear, the number of times the words both seperately appear in the text, and the number of times the words occur together.
Before we focus our search on the word privacy
, we will determine the 10 most commonly occuring bigrams (based on frequency) in the 2015 Hansard Corpus.
In this code we assign the lems
variable to colText
by adding the nltk.Text
functionality. We can then use the NLTK
function collocations
to determine (in this case) the 10 most common bigrams. Changing the number in the brackets will change the number of results returned.
In [9]:
# prints the 10 most common bigrams
colText = nltk.Text(lems)
colText.collocations(10)
For reference, I ran an earlier test that shows the 10 most common bigrams without the stopwords removed. Duplicating this test only requires that stopwords not be removed as the text is being tokenized and cleaned. We can see that there is a clear difference in the types of results returned with and without stopwords applied. The list of words appearing above is much more interesting in terms of discourse analysis, when functional parliamentary phrases like Prime Minister and Parliamentary Secretary have been removed.
Here is a piece of code that shows how the ngram function works. It goes word by word through the text, pairing each word with the one that came before. That's why the last word in the first word pair becomes the first in the next word pair.
We assign our colText
variable to the colBigrams
variable by specifiying that we want to make a list of ngrams containing 2 words. We could obtain trigrams by changing the 2 in the first line of code to a 3. Then, in the second line of code, we display the first 5 results of the colBigrams
variable with :5. We could display the first 10 by changing the number in the square brackets to :10, or show the top 10 results again by removing the colon.
In [10]:
# creates a list of bigrams (ngrams of 2), printing the first 5
colBigrams = list(nltk.ngrams(colText, 2))
colBigrams[:5]
Out[10]:
Here we will check to make sure we've the bigram function has gone through and counted the entire text. Having one less ngram is correct because of the way in which the ngrams are generated word-by-word in the test above.
In [11]:
print("Number of words:", len(lems))
print("Number of bigrams:", len(colBigrams))
In this section we will focus our search on bigrams that contain the word privacy
. First, we'll load the bigram tests from the NLTK
module, then, we will create a filter that only searches for bigrams containing privacy
. To search for bigrams containing other words, the word privacy
in the second line of code can be changed to something else.
In [12]:
# loads bigram code from NLTK
bigram_measures = nltk.collocations.BigramAssocMeasures()
# ngrams with 'privacy' as a member
privacy_filter = lambda *w: 'privacy' not in w
Next, we will load our lemmatized corpus into the bigram collocation finder, apply a frequency filter that only considers bigrams that appear four or more times, and then apply our privacy filter to the results. The variable finder
now contains a list of all the bigrams containing privacy
that occur four or more times.
In [34]:
# bigrams
finder = BigramCollocationFinder.from_words(lems, window_size = 2)
# only bigrams that appear 4+ times
finder.apply_freq_filter(4)
# only bigrams that contain 'privacy'
finder.apply_ngram_filter(privacy_filter)
Before I describe the statistical tests that we will use to determine the collocates for privacy
, it is important to briefly discuss distribution. The chart below maps the distribution of the top 25 terms in the 2015 file.
This is important because some of the tests assume a normal distribution of words in the text. A normal distribution means that the majority of the words occur a majority of the time; it is represented in statistics as a bell curve. This means that 68% of the words would occur within one standard deviation of the mean (or average frequency of each word in the text), 95% within two standard deviations, and 99.7 within three standard deviations.
What this means, is that tests that assume a normal distribution will work, but have inaccurate statistics to back them up. I've chosen to describe all of the collocational tests here as a matter of instruction and description, but it's important to understand the tests and what they assume before making research claims based on their results.
The code below calls on the NLTK
function FreqDist
. The function calculates the frequency of all the words in the variable and charts them in order from highest to lowest. Here I've only requested the first 25, though more or less can be displayed by changing the number in the brackets. Additionally, in order to have the chart displayed inline (and not as a popup), I've called the magic function matplotlib inline
. iPython magic functions are identifable by the % symbol.
In [14]:
%matplotlib inline
fd = nltk.FreqDist(colText)
fd.plot(25)
As we can see from the chart above, work
is the highest frequency word in our lemmatized corpus with stopwords applied, followed by right
. The word privacy
does not even occur in the list. The code below calculates the frequency and percentage of times these words occur in the text. While work
makes up 0.56% of the total words in the text, privacy
accounts for only 0.06%.
In [15]:
print('privacy:',fd['privacy'], 'times or','{:.2%}'.format(float(colText.count("privacy"))/(len(colText))))
print('right:',fd['right'], 'times or','{:.2%}'.format(float(colText.count("right"))/(len(colText))))
print('work:',fd['work'], 'times or','{:.2%}'.format(float(colText.count("work")/(len(colText)))))
To calculate the mean, and standard deviation, we must count the frequency of all the words in the text and append those values to a list. Since the numbers in the list will actually be represented as text (not as integers), we must add an extra line of code to map those values so they can be used mathematically, calling on the map
function.
In [16]:
fdnums = []
for sample in fd:
fdnums.append(fd[sample])
numlist = list(map(int, fdnums))
In [17]:
print("Total of unique words:", len(numlist))
print("Total of words that appear only once:", len(fd.hapaxes()))
print("Percentage of words that appear only once:",'{:.2%}'.format(len(fd.hapaxes())/len(numlist)))
Once we have our numbers in a list, as the variable numlist
, we can use the built in statistics
library for our calculations. Below we've calculated the mean, standard deviation, and the variance.
These numbers prove that the numerical data has a non-normal distribution. The mean is relatively low, compared to the highest frequency word, work
, which appears a total of 7588 times.
The low mean is due to the high number of low frequency words; there are 5847 words that appear only once, totalling 30% of the unique words in the entire set. The standard deviation is higher than the mean, which predicts a high variance of numbers in the set, something that is proven by the variance calculation. A large variance shows that the numbers in the set are far apart from the mean, and each other.
In [18]:
datamean = statistics.mean(numlist)
print("Mean:", '{:.2f}'.format(statistics.mean(numlist)))
print("Standard Deviation:", '{:.2f}'.format(statistics.pstdev(numlist,datamean)))
print("Variance:", '{:.2f}'.format(statistics.pvariance(numlist,datamean)))
The frequency calculations determine both the actual number of occurences of the bigram in the corpus as well as the number of times the bigram occurs relative to the text as a whole (expressed as a percentage).
The Student's T-Score, also called the T-Score, measures the confidence of a claim of collocation and assigns a score based on that certainty. It is computed by subtracting the expected frequency of the bigram by the observed frequency of the bigram, and then dividing the result by the standard deviation which is calculated based on the overall size of the corpus.
The benefit of using the T-Score is that it considers the evidence for collocates based on the overall amount of evidence provided by the size of the corpus. This differs from the PMI score (described below) which only considers strength based on relative frequencies. The drawbacks to the T-Score include its reliance on a normal distribution (due to the incorporation of standard deviation in the calculation), as well as its dependence on the overall size of the corpus. T-scores can't be compared across corpora of different sizes.
The Pointwise Mutual Information Score (known as PMI or MI) measures the strength of a collocation and assigns it a score. It is a probability-based calculation that compares the number of actual bigrams to the expected number of bigrams based on the relative frequency counts of the words. The test compares the expected figure to the observed figure, converting the difference to a number indicating the strength of the collocation.
The benefit of using PMI is that the value of the score is not dependent on the overall size of the corpus, meaning that PMI scores can be compared across corpora of different sizes, unlike the T-score (described above). The drawback to the PMI is that it tends to give high scores to low frequency words when they occur most often in the proximity another word.
The Chi-square (or x2) measures the observed and expected frequencies of bigrams and assigns a score based on the amount of difference between the two using the standard deviation. The Chi-square is another test that relies on a normal distribution.
The Chi-square shares the benefit of the T-score in taking into account the overall size of the corpus. The drawback of the Chi-square is that it doesn't do well with sparse data. This means that low-frequency (but significant) bigrams may not be represented very well, unlike the scores assigned by the PMI.
The Log-likelihood ratio calculates the size and significance between the observed and expected frequencies of bigrams and assigns a score based on the result, taking into account the overall size of the corpus. The larger the difference between the observed and expected, the higher the score, and the more statistically significant the collocate is.
The Log-likelihood ratio is my preferred test for collocates because it does not rely on a normal distribution, and for this reason, it can account for sparse or low frequency bigrams (unlike the Chi-square). But unlike the PMI, it does not over-represent low frequency bigrams with inflated scores, as the test is only reporting how much more likely it is that the frequencies are different than they are the same. The drawback to the Log-likelihood ratio, much like the t-score, is that it cannot be used to compare scores across corpora.
The following code filters the results of the focused bigram search based on the statistical tests as described above, assigning the results to a new variable based on the test.
In [35]:
# filter results based on statistical test
# calulates the raw frequency as an actual number and percentage of total words
act = finder.ngram_fd.items()
raw = finder.score_ngrams(bigram_measures.raw_freq)
# student's - t score
tm = finder.score_ngrams(bigram_measures.student_t)
# pointwise mutual information score
pm = finder.score_ngrams(bigram_measures.pmi)
# chi-square score
ch = finder.score_ngrams(bigram_measures.chi_sq)
# log-likelihood ratio
log = finder.score_ngrams(bigram_measures.likelihood_ratio)
Below are the results for the Log-likelihood test. The bigrams are sorted in order of significance, and the order of the words in the word-pairs shows their placement in the text. This means that the most significant bigram in the Log-likelihood test contained the words digital privacy
, in that order. The word digital
appears later on in the list with a lower score when it occurs after the word privacy
. Scores above 3.8 are considered to be significant for the Log-likelihood test.
In [36]:
print(log)
Let's display this data as a table, and remove some of the extra decimal digits. Using the tabulate module, we call the variable log
, set the table heading names (displayed in red), and set the number of decimal digits to 3 (indicated by floatfmt=".3f"
), with the numbers aligned on the leftmost digit.
In [37]:
print(tabulate(log, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", numalign="left"))
Here we print the results of this table to a CSV file.
In [22]:
with open(name + 'CompleteLog.csv','w') as f:
w = csv.writer(f)
w.writerows(log)
While the table above is nice, it isn't formated exactly the way it could be, especially since we already know that privacy is one half of the bigram. I want to format the list so I can do some further processing in some spreadsheet software, including combining the scores of the bigrams (like digital privacy
and privacy digital
) so I can have one score for each word.
The code below sorts the lists generated by each test by the first word in the bigram, appending them to a dictionary called prefix_keys
, where each word is a key and the score is the value. Then, we sort the keys by the value with the highest score, and assign the new list to a new variable with the word privacy removed. This code must be repeated for each test.
For the purposes of this analysis, we will only output the two frequency tests and the Log-likelihood test.
In [38]:
##################################################################
################ sorts list of ACTUAL frequencies ################
##################################################################
# group bigrams by first and second word in bigram
prefix_keys = collections.defaultdict(list)
for key, a in act:
# first word
prefix_keys[key[0]].append((key[1], a))
# second word
prefix_keys[key[1]].append((key[0], a))
# sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
# remove the word privacy and display the first 50 results
actkeys = prefix_keys['privacy'][:50]
##################################################################
#### sorts list of RAW (expressed as percentage) frequencies #####
##################################################################
# group bigrams by first and second word in bigram
prefix_keys = collections.defaultdict(list)
for key, r in raw:
# first word
prefix_keys[key[0]].append((key[1], r))
# second word
prefix_keys[key[1]].append((key[0], r))
# sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
rawkeys = prefix_keys['privacy'][:50]
##################################################################
############### sorts list of log-likelihood scores ##############
##################################################################
# group bigrams by first and second word in bigram
prefix_keys = collections.defaultdict(list)
for key, l in log:
# first word
prefix_keys[key[0]].append((key[1], l))
# second word
prefix_keys[key[1]].append((key[0], l))
# sort bigrams by strongest association
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
logkeys = prefix_keys['privacy'][:50]
Let's take a look at the new list of scores for the Log-likelihood test, with the word privacy
removed. Nothing has changed here except the formatting.
In [41]:
from tabulate import tabulate
print(tabulate(logkeys, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", numalign="left"))
Again, just for reference, these are the 25 top Log-Likelhood scores for 2015 without the stopwords applied.
Here we will write the sorted results of the tests to a CSV file.
In [25]:
with open(name + 'collocate_Act.csv','w') as f:
w = csv.writer(f)
w.writerows(actkeys)
with open(name + 'collocate_Raw.csv','w') as f:
w = csv.writer(f)
w.writerows(rawkeys)
with open(name + 'collocate_Log.csv','w') as f:
w = csv.writer(f)
w.writerows(logkeys)
What is immediately apparent from the Log-likelihood scores is that there are distinct types of words that co-occur with the word privacy. The top 10 most frequently co-occuring words are digital, protect, ethic, access, right, protection, expectation, and information. Based on this list alone, we can deduce that privacy in the Hansard corpus is a serious topic; one that is concerned with ethics and rights, which are things commonly associated with the law. We can also see that privacy has both a digital and an informational aspect, which are things that have an expectation of both access and protection.
While it may seem obvious that these kinds of words would co-occur with privacy, we now have statistical evidence upon which to build our claim.
Here we repeat the above code, only instead of using one file, we will combine all of the files to obtain the scores for the entire corpus.
In [26]:
corpus = []
for filename in list_textfiles('../Counting Word Frequencies/data2'):
text_2 = read_file(filename)
words_2 = text_2.split()
clean_2 = [w.lower() for w in words_2 if w.isalpha()]
text_2 = [w for w in clean_2 if w not in hansardStopwords]
corpus.append(text_2)
In [27]:
lemm_2 = []
for doc in corpus:
for word in doc:
lemm_2.append(wnl.lemmatize(word, 'v'))
lems_2 = []
for word in lemm_2:
lems_2.append(wnl.lemmatize(word, 'n'))
In [28]:
# prints the 10 most common multi-word pairs (n-grams)
colText_2 = nltk.Text(lems_2)
colText_2.collocations(10)
In [29]:
# bigrams
finder_2 = BigramCollocationFinder.from_words(lems_2, window_size = 2)
# only bigrams that appear 10+ times
finder_2.apply_freq_filter(10)
# only bigrams that contain 'privacy'
finder_2.apply_ngram_filter(privacy_filter)
In [30]:
# filter results based on statistical test
act_2 = finder_2.ngram_fd.items()
raw_2 = finder_2.score_ngrams(bigram_measures.raw_freq)
log_2 = finder_2.score_ngrams(bigram_measures.likelihood_ratio)
In [31]:
##################################################################
################ sorts list of ACTUAL frequencies ################
##################################################################
# group bigrams by first and second word in bigram
prefix_keys = collections.defaultdict(list)
for key, a in act_2:
# first word
prefix_keys[key[0]].append((key[1], a))
# second word
prefix_keys[key[1]].append((key[0], a))
# sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
# remove the word privacy and display the first 50 results
actkeys_2 = prefix_keys['privacy'][:50]
##################################################################
#### sorts list of RAW (expressed as percentage) frequencies #####
##################################################################
# group bigrams by first and second word in bigram
prefix_keys = collections.defaultdict(list)
for key, r in raw_2:
# first word
prefix_keys[key[0]].append((key[1], r))
# second word
prefix_keys[key[1]].append((key[0], r))
# sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
rawkeys_2 = prefix_keys['privacy'][:50]
##################################################################
############### sorts list of log-likelihood scores ##############
##################################################################
# group bigrams by first and second word in bigram
prefix_keys = collections.defaultdict(list)
for key, l in log_2:
# first word
prefix_keys[key[0]].append((key[1], l))
# second word
prefix_keys[key[1]].append((key[0], l))
# sort bigrams by strongest association
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
logkeys_2 = prefix_keys['privacy'][:50]
In [32]:
from tabulate import tabulate
print(tabulate(logkeys_2, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", numalign="left"))
In [33]:
with open('Allcollocate_Act.csv','w') as f:
w = csv.writer(f)
w.writerows(actkeys_2)
with open('Allcollocate_Raw.csv','w') as f:
w = csv.writer(f)
w.writerows(rawkeys_2)
with open('Allcollocate_Log.csv','w') as f:
w = csv.writer(f)
w.writerows(logkeys_2)
The processed spreadsheet including the cumulative scores for all the bigrams for each test for every year and Parliament can be accessed here: CollocationTable. If you plan to use the data, please cite appropriately.