Collocational Analysis

"You shall know a word by the company it keeps!" These are the oft-quoted words of the linguist J.R. Firth in describing the meaning and spirit of collocational analysis. Collocation is a linguistic term for co-occuring. While most words have the possibility of co-occuring with most other words at some point in the English language, when there is a significant statistical relationship between two regularly co-occuring words, we can refer to these as collocates. One of the first, and most cited examples of collocational analysis concerns the words strong and powerful. While both words mean arguably the same thing, it is statistically more common to see the word strong co-occur with the word tea. Native speakers of English can immediately recognize the familiarity of strong tea as opposed to powerful tea, even though the two phrases both make sense in their own way (see Halliday, 1966 for more of this discussion). Interestingly, the same associations do not occur with the phrases strong men and powerful men, although in these instances, both phrases take on slightly different meanings.

These examples highlight the belief of Firthian linguists that the meaning of a word is not confined to the word itself, but in the associations that words have with other co-occuring words. Statistically significant collocates need not be adjacent, just proximal. The patterns of the words in a text, rather than the individual words themselves, have complex, relational units of meaning that allow us to ask questions about the use of lanugage in specific discourses.

In this exercise we will determine the statistical significance of the words that most often co-occur with privacy in an attempt to better understand the meaning of the word as it is used in the Hansard Corpus. We will count the actual frequency of the co-occurence, as well as use a number of different statistical tests of probability. These tests will be conducted first on one file from the corpus, then on the entire corpus itself.

Part 1: Collocational analysis on one file

This section will determine the statistically significant collocates that accompany the word privacy in the file for 2015. Testing file-by-file allows us to track the diachronic (time-based) change and use of the words.

Again, we'll begin by calling on all the functions we will need. Remember that the first few sentences are calling on pre-installed Python modules, and anything with a def at the beginning is a custom function built specifically for these exercises. The text in red describes the purpose of the function.



In [1]:

    
# This is where the modules are imported
import csv
import sys
import codecs
import nltk
import nltk.collocations
import collections
import statistics
from nltk.metrics.spearman import *
from nltk.collocations import *
from nltk.stem import WordNetLemmatizer
from os import listdir
from os.path import splitext
from os.path import basename
from tabulate import tabulate

# These functions iterate through the directory and create a list of filenames

def list_textfiles(directory):
    "Return a list of filenames ending in '.txt'"
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles


def remove_ext(filename):
    "Removes the file extension, such as .txt"
    name, extension = splitext(filename)
    return name


def remove_dir(filepath):
    "Removes the path from the file name"
    name = basename(filepath)
    return name


def get_filename(filepath):
    "Removes the path and file extension from the file name"
    filename = remove_ext(filepath)
    name = remove_dir(filename)
    return name

# This function works on the contents of the files

def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = codecs.open(filename, 'r', 'utf-8')
    contents = infile.read()
    infile.close()
    return contents

Collocational analysis is a frequency-based technique that uses word counts to determine significance. One of the problems with counting word frequencies, as we have seen in other sections, is that the most frequently occuring words in English are function words, like the, of, and and. For this reason, it is neccessary to remove these words in order to obtain meaningful results. In text analysis, these high frequency words are compiled into lists called stopwords. While standard stopword lists are provided by the NLTK module, for the Hansard Corpus it was necessary to remove other kinds of words, like proper nouns (names and place names), and other words common to parliamentary proceedings (like Prime Minister, Speaker, etc.). These words, along with the standard stopwords, can be seen below.

Here, we use our read_file function to read in a text file of custom stopwords, assigning the variable customStopwords. We tokenize the list using the split function and then create a variable called hansardStopwords that incorporates the NLTK stopword list, and adding the words from customStopwords if they don't already occur in the NLTK list.



In [2]:

    
stopwords = read_file('../HansardStopwords.txt')
customStopwords = stopwords.split()

#default stopwords with custom words added
hansardStopwords = nltk.corpus.stopwords.words('english') 
hansardStopwords += customStopwords



In [3]:

    
print(hansardStopwords)









    



['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'a', 'abbott', 'abitibi', 'ablonczy', 'about', 'above', 'acadie', 'accordance', 'according', 'act', 'acting', 'adam', 'addington', 'adler', 'after', 'again', 'against', 'agencies', 'aglukkaq', 'agriculture', 'alan', 'albas', 'albrecht', 'alexandrine', 'alfred', 'algoma', 'alice', 'all', 'allen', 'allison', 'also', 'am', 'ambler', 'ambrose', 'an', 'ancaster', 'and', 'anders', 'anderson', 'andre', 'andrew', 'andrews', 'angus', 'angus:', 'anita', 'ann', 'anthony', 'any', 'arctic', 'are', 'arent', 'argenteuil', 'armstrong', 'art', 'arthur', 'as', 'ashfield', 'ashton', 'aspin', 'asselin', 'assistant', 'at', 'atamanenko', 'aubin', 'aurora', 'avalon', 'aylmer', 'bachand', 'back', 'bagnell', 'baie', 'bains', 'baird', 'barbe', 'barlow', 'barlow', 'barrie', 'barry', 'basques', 'bateman', 'bathurst', 'bay', 'be', 'beauce', 'beaudin', 'because', 'been', 'before', 'being', 'belanger', 'bellavance', 'bellechasse', 'below', 'belt', 'bennett', 'benoit', 'ber', 'bernard', 'bernier', 'betty', 'between', 'bevington', 'bezan', 'biggar', 'bigras', 'bill', 'blackburn', 'blainville', 'blais', 'blake', 'blanchette', 'blaney', 'bloc', 'block', 'board', 'bob', 'bonavista', 'boniface', 'bonsant', 'borduas', 'borg', 'boshcoff', 'both', 'bouchard', 'boucher', 'boughen', 'bourassa', 'bourgeois', 'boutin', 'bq', 'bradley', 'braid', 'brampton', 'breitkreuz', 'breton', 'brian', 'brison', 'british', 'brock', 'brown', 'bruce', 'bruinooge', 'brunelle', 'bruno', 'bryan', 'bryon', 'bulkley', 'bureau', 'burin', 'burnaby', 'but', 'by', 'byrne', 'cadman', 'calandra', 'calgary', 'calkins', 'cambridge', 'can', 'canada', 'canadian', 'canadians', 'cannan', 'cannis', 'cannon', 'cannot', 'canso', 'cant', 'cape', 'cardin', 'cariboo', 'carleton', 'carmichael', 'carol', 'carole', 'caron', 'carrie', 'carrier', 'cartier', 'cash', 'casson', 'catharines', 'cathy', 'cavoukian', 'central', 'centre', 'chambly', 'champlain', 'chantal', 'charlevoix', 'charlie', 'charlottetown', 'charlton', 'charmaine', 'châteauguay', 'chaudiere', 'chicoutimi', 'chief', 'chisu', 'chong', 'chow', 'christian', 'christiane', 'christopherson', 'chuck', 'chungsen', 'churchill', 'chutes', 'clarke', 'claude', 'clement', 'coady', 'coast', 'coderre', 'colchester', 'cole', 'colin', 'colleague', 'columbia', 'comartin', 'commission', 'commissioner', 'commissioners', 'committee', 'commons', 'communities', 'conestoga', 'conservative', 'conservatives', 'constant', 'cooksville', 'coquitlam', 'corneliu', 'costas', 'cote', 'cotler', 'could', 'couldnt', 'country', 'cpc', 'craig', 'creek', 'crockatt', 'crombie', 'crowder', 'crowfoot', 'cullen', 'cumberland', 'cummins', 'cuzner', 'cyr', 'dame', 'damours', 'dan', 'danforth', 'daniel', 'dartmouth', 'daryl', 'dauphin', 'dave', 'davenport', 'david', 'davidson', 'davies', 'day', 'de', 'dean', 'debate', 'debellefeuille', 'dechert', 'deepak', 'del', 'delta', 'demers', 'democratic', 'denis', 'denise', 'dennis', 'des', 'desnoyers', 'devolin', 'dewar', 'dhaliwal', 'dhalla', 'diane', 'did', 'didnt', 'dieppe', 'dion', 'do', 'does', 'doesnt', 'doing', 'dollard', 'don', 'donnelly', 'dont', 'dore', 'dorion', 'dosanjh', 'douglas', 'down', 'doyle', 'dreeshen', 'drummond', 'dryden', 'duceppe', 'dufour', 'duncan', 'dundas', 'during', 'dykstra', 'each', 'east', 'easter', 'eastern', 'ed', 'edmonton', 'edward', 'eeyou', 'eglinski', 'eglinton', 'elgin', 'english', 'epp', 'esquimalt', 'essex', 'etobicoke', 'eve', 'even', 'eyking', 'fabian', 'faille', 'falls', 'fanshawe', 'fantino', 'fast', 'federal', 'few', 'finance', 'findlay', 'finley', 'first', 'fjord', 'flaherty', 'flamborough', 'fletcher', 'folco', 'foote', 'for', 'forward', 'francis', 'freeman', 'from', 'frontenac', 'fry', 'fundy', 'further', 'gagnon', 'galipeau', 'gallant', 'gander', 'garneau', 'garrison', 'garry', 'gatineau', 'gaudet', 'genereux', 'geoff', 'george', 'georges', 'gerald', 'gerard', 'get', 'gill', 'gilles', 'glen', 'glengarry', 'glover', 'go', 'godfrey', 'godin', 'goguen', 'going', 'goldring', 'goodale', 'goodyear', 'gordon', 'gourde', 'government', 'grâce', 'grand', 'gravelle', 'greg', 'grenville', 'grewal', 'groguhe', 'guarnieri', 'guay', 'guelph', 'guergis', 'guildwood', 'guimond', 'guy', 'had', 'hadnt', 'haliburton', 'halifax', 'hall', 'hamilton', 'hanger', 'hants', 'harbour', 'harold', 'harper', 'harris', 'harvey', 'has', 'hasnt', 'hassainia', 'hastings', 'hat', 'haute', 'have', 'havent', 'having', 'hawn', 'hayes', 'he', 'hearn', 'hebert', 'hed', 'hedy', 'hell', 'her', 'here', 'heres', 'heritage', 'hers', 'herself', 'hes', 'hiebert', 'hill', 'hillyer', 'him', 'himself', 'hinton', 'his', 'hoback', 'hochelaga', 'hoeppner', 'holder', 'holland', 'hon', 'house', 'how', 'however', 'hows', 'hsu', 'hubert', 'hughes', 'hull', 'humber', 'hyer', 'i', 'id', 'if', 'ignatieff', 'iles', 'îles', 'ill', 'im', 'in', 'including', 'ind', 'infrastructure', 'interim', 'interlake', 'into', 'irene', 'irwin', 'is', 'island', 'islands', 'isnt', 'it', 'its', 'its', 'itself', 'ive', 'jack', 'jacques', 'jaffer', 'james', 'jay', 'jean', 'jeanne', 'jeff', 'jennifer', 'jennings', 'jim', 'jinny', 'joan', 'joe', 'jogindera', 'john', 'johns', 'joliette', 'jonathan', 'jones', 'josee', 'joseph', 'joy', 'joyce', 'juan', 'judy', 'julian', 'just', 'justin', 'kamp', 'kania', 'kapuskasing', 'karen', 'karygiannis', 'kawartha', 'keddy', 'keith', 'kelly', 'kelowna', 'ken', 'kennedy', 'kenney', 'kenora', 'kent', 'kerr', 'kesteren', 'kevin', 'kildonan', 'kings', 'kingston', 'kingsway', 'kirsty', 'kitchener', 'know', 'komarnicki', 'kramp', 'kyle', 'la', 'labelle', 'labrador', 'lac', 'lachine', 'laforest', 'laframboise', 'lake', 'lakes', 'lalonde', 'lambert', 'lamothe', 'lamoureux', 'lanark', 'last', 'latendresse', 'laurent', 'laurentides', 'laurie', 'laurier', 'laurin', 'lauzon', 'laval', 'lavallee', 'lavar', 'lawrence', 'layton', 'le', 'leader', 'lebel', 'leblanc', 'leduc', 'lee', 'leeds', 'leef', 'lefebvre', 'legislative', 'lemay', 'lemieux', 'lennox', 'leon', 'leona', 'les', 'leslie', 'lessard', 'lets', 'letter', 'leung', 'levesque', 'levis', 'lib', 'libby', 'liberal', 'like', 'lile', 'linda', 'lise', 'liu', 'lizon', 'loan', 'lobb', 'london', 'longueuil', 'lotbiniere', 'louis', 'loyola', 'luc', 'lukiwski', 'lunn', 'lunney', 'lysane', 'macaulay', 'mackay', 'mackenzie', 'mactaquac', 'made', 'make', 'malcolm', 'malhi', 'malo', 'maloway', 'malpeque', 'manicouagan', 'manitoulin', 'manning', 'many', 'maple', 'marc', 'margarets', 'marie', 'mario', 'marjolaine', 'mark', 'markham', 'marlene', 'marquette', 'marston', 'martin', 'masse', 'mastro', 'mathieu', 'mathyssen', 'maurice', 'mauril', 'mayes', 'mccallum', 'mccoleman', 'mcguinty', 'mckay', 'mcleod', 'mcteague', 'me', 'meadows', 'medicine', 'megan', 'meili', 'member', 'members', 'menard', 'mendes', 'menegakis', 'menzies', 'merrifield', 'michael', 'middlesex', 'mike', 'mille', 'miller', 'mills', 'minister', 'minna', 'mirabel', 'mission', 'mississauga', 'mississippi', 'moncton', 'monte', 'montmorency', 'moody', 'moore', 'more', 'morin', 'most', 'motion', 'mount', 'mountain', 'mourani', 'mr', 'ms', 'mulcair', 'murphy', 'murray', 'musquodoboit', 'must', 'mustnt', 'my', 'myron', 'myself', 'nadeau', 'nathan', 'national', 'navdeep', 'ndp', 'need', 'neigette', 'nepean', 'neville', 'new', 'newfoundland', 'newmarket', 'newton', 'nicholson', 'nickel', 'nicolas', 'nicole', 'niki', 'nipissing', 'no', 'nor', 'nord', 'norlock', 'norman', 'north', 'northumberland', 'not', 'notre', 'nova', 'now', 'nunavik', 'oak', 'obhrai', 'oconnor', 'oda', 'of', 'off', 'office', 'oliphant', 'on', 'once', 'one', 'oneill', 'only', 'opitz', 'or', 'orleans', 'oshawa', 'other', 'ottawa', 'ouellet', 'ought', 'our', 'ours', 'ourselves', 'out', 'outremont', 'over', 'own', 'oxford', 'pacetti', 'paille', 'palliser', 'papineau', 'paquette', 'paradis', 'park', 'parliament', 'parliamentary', 'parm', 'part', 'party', 'pascal', 'pat', 'patrick', 'patriotes', 'patry', 'paul', 'pauline', 'payne', 'peace', 'pearl', 'pearson', 'pellan', 'people', 'perkins', 'peter', 'peterborough', 'petit', 'phil', 'picard', 'pierre', 'pierrefonds', 'pitt', 'plamondon', 'poilievre', 'pointe', 'pomerleau', 'pontiac', 'port', 'portneuf', 'prentice', 'prescott', 'president', 'preston', 'prime', 'prince', 'proceedings', 'program', 'proulx', 'public', 'put', 'quadra', 'quappelle', 'quebec', 'question', 'quinte', 'rae', 'rafferty', 'rahim', 'rainy', 'raitt', 'rajotte', 'ralph', 'randall', 'random', 'rankin', 'ratansi', 'rathgeber', 'ravignat', 'ray', 'redman', 'reform', 'regan', 'regina', 'reid', 'repentigny', 'richard', 'richards', 'richardson', 'richmond', 'rick', 'rickford', 'ridge', 'ridges', 'rimouski', 'ritz', 'river', 'riverview', 'riviere', 'rivieres', 'rob', 'robert', 'rod', 'rodger', 'rodney', 'rodriguez', 'rona', 'rosane', 'rosetown', 'rota', 'roxanne', 'royal', 'ruby', 'russell', 'ryan', 's', 'sackville', 'sadia', 'said', 'saint', 'sainte', 'same', 'sana', 'saskatoon', 'sault', 'savage', 'savoie', 'saxton', 'say', 'scarborough', 'scarpaleggia', 'scheer', 'schellenberger', 'scott', 'sea', 'secretary', 'see', 'seeback', 'selkirk', 'sgro', 'shant', 'shawn', 'she', 'shea', 'shed', 'shell', 'shelly', 'sherbrooke', 'sherwood', 'shes', 'shipley', 'shore', 'shory', 'should', 'shouldnt', 'siksay', 'silva', 'simcoe', 'simms', 'sims', 'simson', 'siobhan', 'skeena', 'sky', 'smith', 'so', 'solberg', 'some', 'sopuck', 'sorenson', 'soulanges', 'south', 'southeast', 'southwest', 'speaker', 'st', 'stanton', 'stated', 'statutes', 'ste', 'stella', 'stephen', 'steven', 'stoddart', 'stoffer', 'stoney', 'storseth', 'strahl', 'strathcona', 'such', 'sudbury', 'sukh', 'sunshine', 'support', 'susan', 'swan', 'sweet', 'sydney', 'sylvie', 'szabo', 'take', 'ted', 'temiscamingue', 'temiscouata', 'terrebonne', 'than', 'that', 'thats', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'theres', 'therrien', 'these', 'they', 'theyd', 'theyll', 'theyre', 'theyve', 'thi', 'thibeault', 'thierry', 'think', 'this', 'thomas', 'thompson', 'those', 'though', 'through', 'thunder', 'tilson', 'tim', 'time', 'timiskaming', 'timmins', 'to', 'tobique', 'today', 'todd', 'toews', 'told', 'tom', 'tonks', 'tony', 'too', 'toronto', 'translation', 'treasury', 'tremblay', 'trois', 'trost', 'trottier', 'trudeau', 'truppe', 'tweed', 'two', 'ujjal', 'under', 'unionville', 'until', 'up', 'uppal', 'us', 'valcourt', 'valeriote', 'valley', 'van', 'vancouver', 'vaudreuil', 'vaughan', 'vellacott', 'vercheres', 'verner', 'verte', 'very', 'victoria', 'ville', 'vincent', 'volpe', 'wallace', 'want', 'warawa', 'warkentin', 'was', 'wascana', 'wasnt', 'watson', 'way', 'wayne', 'we', 'wed', 'well', 'welland', 'were', 'were', 'werent', 'west', 'westdale', 'western', 'westlock', 'westminster', 'westmount', 'weston', 'westwood', 'weve', 'what', 'whats', 'when', 'whens', 'where', 'wheres', 'which', 'while', 'whip', 'who', 'whom', 'whos', 'why', 'whys', 'wilfert', 'wilks', 'will', 'williamson', 'willowdale', 'windsor', 'winnipeg', 'with', 'wladyslaw', 'wong', 'wont', 'woodworth', 'would', 'wouldnt', 'wrzesnewskyj', 'year', 'years', 'yelich', 'yellowhead', 'york', 'you', 'youd', 'youll', 'young', 'your', 'youre', 'yours', 'yourself', 'yourselves', 'youve', 'yukon', 'yves', 'yvon', 'yvonne', 'zarac', 'zimmer']

Now, we use read_file to load the contents of the file for 2015. For consistency and to avoid file duplication, we're always reading the files from the same directory. Even though it was used for other sections, the data is the same. We read the contents of the text file, then remove the case and punctuation from the text, split the words into a list of tokens, and assign the words in each file to a list with the variable name text. What's new here, compared to other sections, is the additional removal of stopwords.



In [3]:

    
file = '../Counting Word Frequencies/data/2009.txt'
name = get_filename(file)



In [4]:

    
# opens, reads, and tokenizes the file
text = read_file(file)
words = text.split()
clean = [w.lower() for w in words if w.isalpha()]
# removes stopwords
text = [w for w in clean if w not in hansardStopwords]

Another type of processing required for the generation of accurate collocational statistics is called lemmatization. In linguistics, a lemma is the grammatical base or stem of a word. For example, the word protect is the lemma of the verbs protecting and protected, while ethic is the lemma of the noun ethics. When we lemmatize a text, we are removing the grammatical inflections of the word forms (like ing or ed). The purpose of lemmatization for the Hansard Corpus is to obtain more accurate statistics for collocation by avoiding multiple entries for similar but different word forms (like protecting, protected, and protect). For the purpose of this text analysis, I have decided to lemmatize only the nouns and verbs in the Hansard Corpus, as the word privacy is not easily modified by adjectives (or at all by adverbs).

The lemmatizer I have used for this project was developed by Princeton and is called WordNet. Lemmas and their grammatical inflections can be searched using their web interface.

In the code below, I load the WordNetLemmatizer (another function included in the NLTK module) into the variable wnl. Then, I iterate through the text, first lematizing the verbs (shown as v), then the nouns (shown as n). Unfortunately, the WordNet function only takes one argument, so this code requires two pass-throughs of the text. I'm sure there is a more elegant way to construct this code, though I've not found it yet. This is another reason why I've decided only to lemmatize verbs and nouns, rather than including adjectives and adverbs.



In [5]:

    
# creates a variable for the lemmatizing function
wnl = WordNetLemmatizer()

# lemmatizes all of the verbs
lemm = []
for word in text:
    lemm.append(wnl.lemmatize(word, 'v'))

# lemmatizes all of the nouns 
lems = []
for word in lemm:
    lems.append(wnl.lemmatize(word, 'n'))



In [50]:

    
print("Number of words:", len(lems))









    



Number of words: 9500964

We need to make sure that the lemmatizer did something. Since we've only lemmatized for nouns and verbs, we check that here against the unlemmatized corpus, where text has not been lemmatized and lems has. Below we see that noun ethics appears 156 times in the text variable and 0 times in the lems variable. But the lemma for ethics: ethic, remains in the lems variable for a frequency of 161 times. Similar values are repeated for the verb and variations of protect.



In [7]:

    
print('NOUNS')
print('ethics:', text.count('ethics'))
print('ethics:', lems.count('ethics'))
print('ethic:', lems.count('ethic'))
print('\n')
print('VERBS')
print('protecting:', text.count('protecting'))
print('protecting:', lems.count('protecting'))
print('protected:', text.count('protected'))
print('protected:', lems.count('protected'))
print('protect:', lems.count('protect'))









    



NOUNS
ethics: 156
ethics: 0
ethic: 161


VERBS
protecting: 737
protecting: 0
protected: 268
protected: 0
protect: 3401

Here we check that the lemmatizer hasn't been over-zealous by determining the frequency for privacy before and after the lemmatizing function. The frequencies are the same, meaning we've not lost anything in the lemmatization.



In [8]:

    
print('privacy:', text.count('privacy'))
print('privacy:', lems.count('privacy'))









    



privacy: 806
privacy: 806

Part 1.1: Unfocused Bigram Search

Let's clarify some of the words we will be using in the rest of this exercise:

ngram = catch-all term for multiple word occurences
bigram = word pairs
trigram = three-word phrases

After the stopwords have been removed and the nouns and verbs lemmatized, we are ready to determine statistics for co-occuring words, or collocates. Any collocational test requires four pieces of data: the length of the text in which the words appear, the number of times the words both seperately appear in the text, and the number of times the words occur together.

Before we focus our search on the word privacy, we will determine the 10 most commonly occuring bigrams (based on frequency) in the 2015 Hansard Corpus.

In this code we assign the lems variable to colText by adding the nltk.Text functionality. We can then use the NLTK function collocations to determine (in this case) the 10 most common bigrams. Changing the number in the brackets will change the number of results returned.



In [9]:

    
# prints the 10 most common bigrams
colText = nltk.Text(lems)
colText.collocations(10)









    



small business; child care; supreme court; criminal code; action plan;
employment social; health care; foreign affair; social development;
spinal cord

For reference, I ran an earlier test that shows the 10 most common bigrams without the stopwords removed. Duplicating this test only requires that stopwords not be removed as the text is being tokenized and cleaned. We can see that there is a clear difference in the types of results returned with and without stopwords applied. The list of words appearing above is much more interesting in terms of discourse analysis, when functional parliamentary phrases like Prime Minister and Parliamentary Secretary have been removed.

prime minister; would like; parliamentary secretary; first nation; public safety; british columbia; act speaker; small business; ontario cpc; new democrat

Here is a piece of code that shows how the ngram function works. It goes word by word through the text, pairing each word with the one that came before. That's why the last word in the first word pair becomes the first in the next word pair.

We assign our colText variable to the colBigrams variable by specifiying that we want to make a list of ngrams containing 2 words. We could obtain trigrams by changing the 2 in the first line of code to a 3. Then, in the second line of code, we display the first 5 results of the colBigrams variable with :5. We could display the first 10 by changing the number in the square brackets to :10, or show the top 10 results again by removing the colon.



In [10]:

    
# creates a list of bigrams (ngrams of 2), printing the first 5
colBigrams = list(nltk.ngrams(colText, 2)) 
colBigrams[:5]









    Out[10]:





[('official', 'report'),
 ('report', 'debate'),
 ('debate', 'volume'),
 ('volume', 'number'),
 ('number', 'session')]

Here we will check to make sure we've the bigram function has gone through and counted the entire text. Having one less ngram is correct because of the way in which the ngrams are generated word-by-word in the test above.



In [11]:

    
print("Number of words:", len(lems))
print("Number of bigrams:", len(colBigrams))









    



Number of words: 1361982
Number of bigrams: 1361981

Part 1.2: Focused Bigram Search

In this section we will focus our search on bigrams that contain the word privacy. First, we'll load the bigram tests from the NLTK module, then, we will create a filter that only searches for bigrams containing privacy. To search for bigrams containing other words, the word privacy in the second line of code can be changed to something else.



In [12]:

    
# loads bigram code from NLTK
bigram_measures = nltk.collocations.BigramAssocMeasures()

# ngrams with 'privacy' as a member
privacy_filter = lambda *w: 'privacy' not in w

Next, we will load our lemmatized corpus into the bigram collocation finder, apply a frequency filter that only considers bigrams that appear four or more times, and then apply our privacy filter to the results. The variable finder now contains a list of all the bigrams containing privacy that occur four or more times.



In [34]:

    
# bigrams
finder = BigramCollocationFinder.from_words(lems, window_size = 2)
# only bigrams that appear 4+ times
finder.apply_freq_filter(4)
# only bigrams that contain 'privacy'
finder.apply_ngram_filter(privacy_filter)

Distribution

Before I describe the statistical tests that we will use to determine the collocates for privacy, it is important to briefly discuss distribution. The chart below maps the distribution of the top 25 terms in the 2015 file.

This is important because some of the tests assume a normal distribution of words in the text. A normal distribution means that the majority of the words occur a majority of the time; it is represented in statistics as a bell curve. This means that 68% of the words would occur within one standard deviation of the mean (or average frequency of each word in the text), 95% within two standard deviations, and 99.7 within three standard deviations.

What this means, is that tests that assume a normal distribution will work, but have inaccurate statistics to back them up. I've chosen to describe all of the collocational tests here as a matter of instruction and description, but it's important to understand the tests and what they assume before making research claims based on their results.

The code below calls on the NLTK function FreqDist. The function calculates the frequency of all the words in the variable and charts them in order from highest to lowest. Here I've only requested the first 25, though more or less can be displayed by changing the number in the brackets. Additionally, in order to have the chart displayed inline (and not as a popup), I've called the magic function matplotlib inline. iPython magic functions are identifable by the % symbol.



In [14]:

    
%matplotlib inline
fd = nltk.FreqDist(colText)
fd.plot(25)

As we can see from the chart above, work is the highest frequency word in our lemmatized corpus with stopwords applied, followed by right. The word privacy does not even occur in the list. The code below calculates the frequency and percentage of times these words occur in the text. While work makes up 0.56% of the total words in the text, privacy accounts for only 0.06%.



In [15]:

    
print('privacy:',fd['privacy'], 'times or','{:.2%}'.format(float(colText.count("privacy"))/(len(colText))))
print('right:',fd['right'], 'times or','{:.2%}'.format(float(colText.count("right"))/(len(colText))))
print('work:',fd['work'], 'times or','{:.2%}'.format(float(colText.count("work")/(len(colText)))))









    



privacy: 806 times or 0.06%
right: 5632 times or 0.41%
work: 7588 times or 0.56%

To calculate the mean, and standard deviation, we must count the frequency of all the words in the text and append those values to a list. Since the numbers in the list will actually be represented as text (not as integers), we must add an extra line of code to map those values so they can be used mathematically, calling on the map function.



In [16]:

    
fdnums = []
for sample in fd:
    fdnums.append(fd[sample])
numlist = list(map(int, fdnums))



In [17]:

    
print("Total of unique words:", len(numlist))
print("Total of words that appear only once:", len(fd.hapaxes()))
print("Percentage of words that appear only once:",'{:.2%}'.format(len(fd.hapaxes())/len(numlist)))









    



Total of unique words: 18912
Total of words that appear only once: 5847
Percentage of words that appear only once: 30.92%

Once we have our numbers in a list, as the variable numlist, we can use the built in statistics library for our calculations. Below we've calculated the mean, standard deviation, and the variance.

These numbers prove that the numerical data has a non-normal distribution. The mean is relatively low, compared to the highest frequency word, work, which appears a total of 7588 times.

The low mean is due to the high number of low frequency words; there are 5847 words that appear only once, totalling 30% of the unique words in the entire set. The standard deviation is higher than the mean, which predicts a high variance of numbers in the set, something that is proven by the variance calculation. A large variance shows that the numbers in the set are far apart from the mean, and each other.



In [18]:

    
datamean = statistics.mean(numlist)
print("Mean:", '{:.2f}'.format(statistics.mean(numlist)))
print("Standard Deviation:", '{:.2f}'.format(statistics.pstdev(numlist,datamean)))
print("Variance:", '{:.2f}'.format(statistics.pvariance(numlist,datamean)))









    



Mean: 72.02
Standard Deviation: 300.24
Variance: 90144.28

Statistics

Raw Frequency

The frequency calculations determine both the actual number of occurences of the bigram in the corpus as well as the number of times the bigram occurs relative to the text as a whole (expressed as a percentage).

Student's-T

The Student's T-Score, also called the T-Score, measures the confidence of a claim of collocation and assigns a score based on that certainty. It is computed by subtracting the expected frequency of the bigram by the observed frequency of the bigram, and then dividing the result by the standard deviation which is calculated based on the overall size of the corpus.

The benefit of using the T-Score is that it considers the evidence for collocates based on the overall amount of evidence provided by the size of the corpus. This differs from the PMI score (described below) which only considers strength based on relative frequencies. The drawbacks to the T-Score include its reliance on a normal distribution (due to the incorporation of standard deviation in the calculation), as well as its dependence on the overall size of the corpus. T-scores can't be compared across corpora of different sizes.

Pointwise Mutual Information

The Pointwise Mutual Information Score (known as PMI or MI) measures the strength of a collocation and assigns it a score. It is a probability-based calculation that compares the number of actual bigrams to the expected number of bigrams based on the relative frequency counts of the words. The test compares the expected figure to the observed figure, converting the difference to a number indicating the strength of the collocation.

The benefit of using PMI is that the value of the score is not dependent on the overall size of the corpus, meaning that PMI scores can be compared across corpora of different sizes, unlike the T-score (described above). The drawback to the PMI is that it tends to give high scores to low frequency words when they occur most often in the proximity another word.

Chi-square

The Chi-square (or x²) measures the observed and expected frequencies of bigrams and assigns a score based on the amount of difference between the two using the standard deviation. The Chi-square is another test that relies on a normal distribution.

The Chi-square shares the benefit of the T-score in taking into account the overall size of the corpus. The drawback of the Chi-square is that it doesn't do well with sparse data. This means that low-frequency (but significant) bigrams may not be represented very well, unlike the scores assigned by the PMI.

Log-Likelihood Ratio

The Log-likelihood ratio calculates the size and significance between the observed and expected frequencies of bigrams and assigns a score based on the result, taking into account the overall size of the corpus. The larger the difference between the observed and expected, the higher the score, and the more statistically significant the collocate is.

The Log-likelihood ratio is my preferred test for collocates because it does not rely on a normal distribution, and for this reason, it can account for sparse or low frequency bigrams (unlike the Chi-square). But unlike the PMI, it does not over-represent low frequency bigrams with inflated scores, as the test is only reporting how much more likely it is that the frequencies are different than they are the same. The drawback to the Log-likelihood ratio, much like the t-score, is that it cannot be used to compare scores across corpora.

The following code filters the results of the focused bigram search based on the statistical tests as described above, assigning the results to a new variable based on the test.



In [35]:

    
# filter results based on statistical test

# calulates the raw frequency as an actual number and percentage of total words
act = finder.ngram_fd.items()
raw = finder.score_ngrams(bigram_measures.raw_freq)
# student's - t score
tm = finder.score_ngrams(bigram_measures.student_t)
# pointwise mutual information score
pm = finder.score_ngrams(bigram_measures.pmi)
# chi-square score
ch = finder.score_ngrams(bigram_measures.chi_sq)
# log-likelihood ratio
log = finder.score_ngrams(bigram_measures.likelihood_ratio)

Below are the results for the Log-likelihood test. The bigrams are sorted in order of significance, and the order of the words in the word-pairs shows their placement in the text. This means that the most significant bigram in the Log-likelihood test contained the words digital privacy, in that order. The word digital appears later on in the list with a lower score when it occurs after the word privacy. Scores above 3.8 are considered to be significant for the Log-likelihood test.



In [36]:

    
print(log)









    



[(('digital', 'privacy'), 396.5081168717076), (('privacy', 'privacy'), 126.69320774233589), (('protect', 'privacy'), 65.8741985771311), (('information', 'privacy'), 49.65685937440095), (('privacy', 'ethic'), 43.82420230452009), (('privacy', 'digital'), 43.15111821847787), (('privacy', 'personal'), 38.10615016677362), (('privacy', 'organization'), 37.87867835587883), (('access', 'privacy'), 37.36691254483231), (('privacy', 'information'), 34.514495549437896), (('safeguard', 'privacy'), 33.068170557692035), (('privacy', 'protection'), 32.62222285636642), (('privacy', 'right'), 27.238959186699113), (('privacy', 'dusseault'), 27.203870989981993), (('breach', 'privacy'), 25.433231494351276), (('data', 'privacy'), 24.288574765460197), (('online', 'privacy'), 22.269260561092445), (('expectation', 'privacy'), 21.378038093619825), (('wai', 'privacy'), 18.86321770455099), (('privacy', 'patricia'), 18.232641258191514), (('personal', 'privacy'), 17.801249315765425), (('yurdiga', 'privacy'), 16.70277483144809), (('protection', 'privacy'), 14.774680808312835), (('respect', 'privacy'), 13.519030851823137), (('privacy', 'breach'), 13.400756601365199), (('privacy', 'express'), 12.47336477849932), (('report', 'privacy'), 11.707374865672493), (('strengthen', 'privacy'), 10.67289730252202), (('right', 'privacy'), 10.22579815618001), (('sector', 'privacy'), 9.878751820658572), (('privacy', 'power'), 9.685090628819804), (('individual', 'privacy'), 9.624692452061522), (('privacy', 'appear'), 9.39134832242008), (('private', 'privacy'), 8.778602318086637), (('balance', 'privacy'), 8.514290521370409), (('privacy', 'protect'), 8.365902478261175), (('review', 'privacy'), 8.282541333455605), (('share', 'privacy'), 7.525222263793888), (('power', 'privacy'), 7.265969113438704), (('privacy', 'private'), 6.5141079759135865), (('privacy', 'allow'), 6.046243305876631), (('privacy', 'set'), 5.606891511046996), (('privacy', 'concern'), 5.517852785860679), (('privacy', 'require'), 5.362455501407316), (('privacy', 'rule'), 4.884692779396091), (('privacy', 'amendment'), 4.877203220726191), (('serious', 'privacy'), 4.805000532669893), (('law', 'privacy'), 4.737437439780342), (('privacy', 'law'), 4.737437439780342), (('concern', 'privacy'), 4.721989985307719), (('legislation', 'privacy'), 4.496542749600849), (('stand', 'privacy'), 4.483254772787875), (('privacy', 'individual'), 4.455309514875408), (('privacy', 'important'), 4.361730293302768), (('privacy', 'access'), 4.088783508980829), (('privacy', 'share'), 3.81086056962609), (('privacy', 'propose'), 3.769572074256775), (('privacy', 'company'), 3.67036362097937), (('privacy', 'security'), 3.6317738718551524), (('privacy', 'need'), 3.249153293418055), (('amendment', 'privacy'), 3.1915420047288983), (('privacy', 'respect'), 3.174166497015441), (('give', 'privacy'), 2.911086580344641), (('order', 'privacy'), 2.7095009799863465), (('privacy', 'report'), 2.6313872029787353), (('speak', 'privacy'), 2.410272209175763), (('privacy', 'whether'), 2.2641109390982868), (('privacy', 'provide'), 2.2422187425755373), (('privacy', 'business'), 1.9337383321180868), (('security', 'privacy'), 1.9165636561221167), (('privacy', 'legislation'), 1.7904667913985945), (('privacy', 'number'), 1.7661282687457769), (('ensure', 'privacy'), 1.6695647786639718), (('change', 'privacy'), 1.5329316469407526), (('privacy', 'change'), 1.5329316469407526), (('privacy', 'believe'), 1.512889704482638), (('hear', 'privacy'), 1.245451482664171), (('privacy', 'give'), 1.1724626117078518), (('important', 'privacy'), 1.0644158510424404), (('issue', 'privacy'), 0.9188763679430347), (('privacy', 'issue'), 0.9188763679430347), (('privacy', 'come'), 0.5353546988159024), (('privacy', 'stand'), 0.5118751559489292), (('privacy', 'take'), 0.22083394913622412), (('privacy', 'ensure'), 0.16713544313608786), (('come', 'privacy'), 0.001533666954220969)]

Let's display this data as a table, and remove some of the extra decimal digits. Using the tabulate module, we call the variable log, set the table heading names (displayed in red), and set the number of decimal digits to 3 (indicated by floatfmt=".3f"), with the numbers aligned on the leftmost digit.



In [37]:

    
print(tabulate(log, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", numalign="left"))









    



Collocate                    Log-Likelihood
---------------------------  ----------------
('digital', 'privacy')       396.508
('privacy', 'privacy')       126.693
('protect', 'privacy')       65.874
('information', 'privacy')   49.657
('privacy', 'ethic')         43.824
('privacy', 'digital')       43.151
('privacy', 'personal')      38.106
('privacy', 'organization')  37.879
('access', 'privacy')        37.367
('privacy', 'information')   34.514
('safeguard', 'privacy')     33.068
('privacy', 'protection')    32.622
('privacy', 'right')         27.239
('privacy', 'dusseault')     27.204
('breach', 'privacy')        25.433
('data', 'privacy')          24.289
('online', 'privacy')        22.269
('expectation', 'privacy')   21.378
('wai', 'privacy')           18.863
('privacy', 'patricia')      18.233
('personal', 'privacy')      17.801
('yurdiga', 'privacy')       16.703
('protection', 'privacy')    14.775
('respect', 'privacy')       13.519
('privacy', 'breach')        13.401
('privacy', 'express')       12.473
('report', 'privacy')        11.707
('strengthen', 'privacy')    10.673
('right', 'privacy')         10.226
('sector', 'privacy')        9.879
('privacy', 'power')         9.685
('individual', 'privacy')    9.625
('privacy', 'appear')        9.391
('private', 'privacy')       8.779
('balance', 'privacy')       8.514
('privacy', 'protect')       8.366
('review', 'privacy')        8.283
('share', 'privacy')         7.525
('power', 'privacy')         7.266
('privacy', 'private')       6.514
('privacy', 'allow')         6.046
('privacy', 'set')           5.607
('privacy', 'concern')       5.518
('privacy', 'require')       5.362
('privacy', 'rule')          4.885
('privacy', 'amendment')     4.877
('serious', 'privacy')       4.805
('law', 'privacy')           4.737
('privacy', 'law')           4.737
('concern', 'privacy')       4.722
('legislation', 'privacy')   4.497
('stand', 'privacy')         4.483
('privacy', 'individual')    4.455
('privacy', 'important')     4.362
('privacy', 'access')        4.089
('privacy', 'share')         3.811
('privacy', 'propose')       3.770
('privacy', 'company')       3.670
('privacy', 'security')      3.632
('privacy', 'need')          3.249
('amendment', 'privacy')     3.192
('privacy', 'respect')       3.174
('give', 'privacy')          2.911
('order', 'privacy')         2.710
('privacy', 'report')        2.631
('speak', 'privacy')         2.410
('privacy', 'whether')       2.264
('privacy', 'provide')       2.242
('privacy', 'business')      1.934
('security', 'privacy')      1.917
('privacy', 'legislation')   1.790
('privacy', 'number')        1.766
('ensure', 'privacy')        1.670
('change', 'privacy')        1.533
('privacy', 'change')        1.533
('privacy', 'believe')       1.513
('hear', 'privacy')          1.245
('privacy', 'give')          1.172
('important', 'privacy')     1.064
('issue', 'privacy')         0.919
('privacy', 'issue')         0.919
('privacy', 'come')          0.535
('privacy', 'stand')         0.512
('privacy', 'take')          0.221
('privacy', 'ensure')        0.167
('come', 'privacy')          0.002

Here we print the results of this table to a CSV file.



In [22]:

    
with open(name + 'CompleteLog.csv','w') as f:
    w = csv.writer(f)
    w.writerows(log)

While the table above is nice, it isn't formated exactly the way it could be, especially since we already know that privacy is one half of the bigram. I want to format the list so I can do some further processing in some spreadsheet software, including combining the scores of the bigrams (like digital privacy and privacy digital) so I can have one score for each word.

The code below sorts the lists generated by each test by the first word in the bigram, appending them to a dictionary called prefix_keys, where each word is a key and the score is the value. Then, we sort the keys by the value with the highest score, and assign the new list to a new variable with the word privacy removed. This code must be repeated for each test.

For the purposes of this analysis, we will only output the two frequency tests and the Log-likelihood test.



In [38]:

    
##################################################################
################ sorts list of ACTUAL frequencies ################
##################################################################

# group bigrams by first and second word in bigram                                        
prefix_keys = collections.defaultdict(list)
for key, a in act:
    # first word
    prefix_keys[key[0]].append((key[1], a))
     # second word
    prefix_keys[key[1]].append((key[0], a))
    
# sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

# remove the word privacy and display the first 50 results
actkeys = prefix_keys['privacy'][:50]

##################################################################
#### sorts list of RAW (expressed as percentage) frequencies #####
##################################################################

# group bigrams by first and second word in bigram                                         
prefix_keys = collections.defaultdict(list)
for key, r in raw:
    # first word
    prefix_keys[key[0]].append((key[1], r))
    # second word
    prefix_keys[key[1]].append((key[0], r))
    
# sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

rawkeys = prefix_keys['privacy'][:50]

##################################################################
############### sorts list of log-likelihood scores ##############
##################################################################

# group bigrams by first and second word in bigram                                        
prefix_keys = collections.defaultdict(list)
for key, l in log:
    # first word
    prefix_keys[key[0]].append((key[1], l))
    # second word
    prefix_keys[key[1]].append((key[0], l))
    
# sort bigrams by strongest association                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

logkeys = prefix_keys['privacy'][:50]

Let's take a look at the new list of scores for the Log-likelihood test, with the word privacy removed. Nothing has changed here except the formatting.



In [41]:

    
from tabulate import tabulate
print(tabulate(logkeys, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", numalign="left"))









    



Collocate     Log-Likelihood
------------  ----------------
digital       396.508
privacy       126.693
privacy       126.693
protect       65.874
information   49.657
ethic         43.824
digital       43.151
personal      38.106
organization  37.879
access        37.367
information   34.514
safeguard     33.068
protection    32.622
right         27.239
dusseault     27.204
breach        25.433
data          24.289
online        22.269
expectation   21.378
wai           18.863
patricia      18.233
personal      17.801
yurdiga       16.703
protection    14.775
respect       13.519
breach        13.401
express       12.473
report        11.707
strengthen    10.673
right         10.226
sector        9.879
power         9.685
individual    9.625
appear        9.391
private       8.779
balance       8.514
protect       8.366
review        8.283
share         7.525
power         7.266
private       6.514
allow         6.046
set           5.607
concern       5.518
require       5.362
rule          4.885
amendment     4.877
serious       4.805
law           4.737
law           4.737

Again, just for reference, these are the 25 top Log-Likelhood scores for 2015 without the stopwords applied.

Collocate Log-Likelihood ------------ ---------------- commissioner 2526.639692684 digital 1645.943598842 the 778.568954598 act 600.684800536 and 117.297111827 right 107.573441279 protection 94.322301426 their 62.085308377 protect 58.045722275 online 57.526177072 sector 51.091917412 interim 40.912165767 to 36.187103272 law 30.168200789 the 29.228587343 of 27.547037820 which 22.595494211 practice 21.302055792 daniel 20.643040763 strengthen 18.703927899 expert 18.313791177 a 18.065369736 legislation 14.152985001 policy 12.467935832 current 12.367826207

Here we will write the sorted results of the tests to a CSV file.



In [25]:

    
with open(name + 'collocate_Act.csv','w') as f:
    w = csv.writer(f)
    w.writerows(actkeys)

with open(name + 'collocate_Raw.csv','w') as f:
    w = csv.writer(f)
    w.writerows(rawkeys)
    
with open(name + 'collocate_Log.csv','w') as f:
    w = csv.writer(f)
    w.writerows(logkeys)

What is immediately apparent from the Log-likelihood scores is that there are distinct types of words that co-occur with the word privacy. The top 10 most frequently co-occuring words are digital, protect, ethic, access, right, protection, expectation, and information. Based on this list alone, we can deduce that privacy in the Hansard corpus is a serious topic; one that is concerned with ethics and rights, which are things commonly associated with the law. We can also see that privacy has both a digital and an informational aspect, which are things that have an expectation of both access and protection.

While it may seem obvious that these kinds of words would co-occur with privacy, we now have statistical evidence upon which to build our claim.

Part 2: Reading the whole corpus

Here we repeat the above code, only instead of using one file, we will combine all of the files to obtain the scores for the entire corpus.



In [26]:

    
corpus = []
for filename in list_textfiles('../Counting Word Frequencies/data2'):
    text_2 = read_file(filename)
    words_2 = text_2.split()
    clean_2 = [w.lower() for w in words_2 if w.isalpha()]
    text_2 = [w for w in clean_2 if w not in hansardStopwords]
    corpus.append(text_2)



In [27]:

    
lemm_2 = []
for doc in corpus:
    for word in doc:
        lemm_2.append(wnl.lemmatize(word, 'v'))
lems_2 = []
for word in lemm_2:
    lems_2.append(wnl.lemmatize(word, 'n'))



In [28]:

    
# prints the 10 most common multi-word pairs (n-grams)
colText_2 = nltk.Text(lems_2)
colText_2.collocations(10)









    



employment insurance; free trade; human right; criminal code; unite
state; foreign affair; greenhouse gas; supreme court; economic
development; health care



In [29]:

    
# bigrams
finder_2 = BigramCollocationFinder.from_words(lems_2, window_size = 2)
# only bigrams that appear 10+ times
finder_2.apply_freq_filter(10)
# only bigrams that contain 'privacy'
finder_2.apply_ngram_filter(privacy_filter)



In [30]:

    
# filter results based on statistical test
act_2 = finder_2.ngram_fd.items()
raw_2 = finder_2.score_ngrams(bigram_measures.raw_freq)
log_2 = finder_2.score_ngrams(bigram_measures.likelihood_ratio)



In [31]:

    
##################################################################
################ sorts list of ACTUAL frequencies ################
##################################################################

# group bigrams by first and second word in bigram                                        
prefix_keys = collections.defaultdict(list)
for key, a in act_2:
    # first word
    prefix_keys[key[0]].append((key[1], a))
     # second word
    prefix_keys[key[1]].append((key[0], a))
    
# sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

# remove the word privacy and display the first 50 results
actkeys_2 = prefix_keys['privacy'][:50]

##################################################################
#### sorts list of RAW (expressed as percentage) frequencies #####
##################################################################

# group bigrams by first and second word in bigram                                         
prefix_keys = collections.defaultdict(list)
for key, r in raw_2:
    # first word
    prefix_keys[key[0]].append((key[1], r))
    # second word
    prefix_keys[key[1]].append((key[0], r))
    
# sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

rawkeys_2 = prefix_keys['privacy'][:50]

##################################################################
############### sorts list of log-likelihood scores ##############
##################################################################

# group bigrams by first and second word in bigram                                        
prefix_keys = collections.defaultdict(list)
for key, l in log_2:
    # first word
    prefix_keys[key[0]].append((key[1], l))
    # second word
    prefix_keys[key[1]].append((key[0], l))
    
# sort bigrams by strongest association                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

logkeys_2 = prefix_keys['privacy'][:50]



In [32]:

    
from tabulate import tabulate
print(tabulate(logkeys_2, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", numalign="left"))









    



Collocate        Log-Likelihood
---------------  ----------------
access           5110.393
ethic            5107.996
digital          1920.924
protect          1839.608
right            1229.061
information      1062.382
invasion         904.061
privacy          553.362
privacy          553.362
protection       476.078
breach           444.387
breach           376.297
respect          350.245
right            349.423
personal         345.511
violation        313.199
violate          306.068
protection       275.448
personal         253.184
invade           246.280
expectation      237.897
concern          215.959
issue            207.534
law              204.353
information      192.077
ensure           190.866
concern          174.224
online           135.406
competition      129.990
intrude          110.100
civil            108.547
csec             100.214
balance          98.426
confidentiality  90.763
digital          90.598
safeguard        89.725
intrusion        84.955
routine          78.433
access           75.575
report           73.453
individual       69.919
protect          67.792
security         64.219
infringe         63.697
expert           62.525
honour           62.467
interest         61.506
compromise       60.703
liberty          56.292
issue            56.036



In [33]:

    
with open('Allcollocate_Act.csv','w') as f:
    w = csv.writer(f)
    w.writerows(actkeys_2)

with open('Allcollocate_Raw.csv','w') as f:
    w = csv.writer(f)
    w.writerows(rawkeys_2)
    
with open('Allcollocate_Log.csv','w') as f:
    w = csv.writer(f)
    w.writerows(logkeys_2)

The processed spreadsheet including the cumulative scores for all the bigrams for each test for every year and Parliament can be accessed here: CollocationTable. If you plan to use the data, please cite appropriately.