The purpose of this notebook is to illustrate the method of text analysis using a corpus created from digital content published by the CRTC. This is the second part in a two-part process, the first of which is a description of the code that 'scraped' the CRTC webpage to create the corpus.
The code below imports the modules that are required to process the text.
In [63]:
# importing code modules
import json
import ijson
from ijson import items
import pprint
from tabulate import tabulate
import matplotlib.pyplot as plt
import re
import csv
import sys
import codecs
import nltk
import nltk.collocations
import collections
import statistics
from nltk.metrics.spearman import *
from nltk.collocations import *
from nltk.stem import WordNetLemmatizer
# This is a function for reading the contents of files
def read_file(filename):
"Read the contents of FILENAME and return as a string."
infile = codecs.open(filename, 'r', 'utf-8')
contents = infile.read()
infile.close()
return contents
This code loads and then reads the necessary files: the json
file with all the hearing text, and a txt
file with a list of stopwords, taken from here: http://www.lextek.com/manuals/onix/stopwords2.html. I've also added a few custom words.
In [2]:
# loading the JSON file
filename = "../scrapy/hearing_result6.json"
# loading the stopwords file
stopwords = read_file('cornellStopWords.txt')
customStopwords = stopwords.split()
In [3]:
# reads the file and assigns the keys and values to a Python dictionary structure
with open(filename, 'r') as f:
objects = ijson.items(f, 'item')
file = list(objects)
A bit of error checking here to confirm the number of records in the file. We should have 14.
In [4]:
# checks to see how many records we have
print(len(file))
Changing the number in the code below will print a different record from the file. Please remember that in coding, numbered lists begin at 0
.
In [ ]:
# commenting this out to make the github notebook more readable.
# prints all content in a single record. Changing the number shows a different record
file[0]
Here is a bit more error checking to confirm the record titles and their urls.
In [ ]:
# iterates through each record in the file
for row in file:
# prints the title of each record and its url
print(row['title'], ":", row['url'])
And a bit more processing to make the text more readable. It's printed below.
In [5]:
# appends all of the text items to a single string object (rather than a list)
joined_text = []
for row in file:
joined_text.append(' '.join(row['text']))
In [ ]:
# shows the text. Changing the number displays a different record...
# ...changing/removing the second number limits/expands the text shown.
print(joined_text[5][:750])
In [6]:
# splits the text string in each record into a list of separate words
token_joined = []
for words in joined_text:
# splits the text into a list of words
text = words.split()
# makes all words lowercase
clean = [w.lower() for w in text if w.isalpha()]
# applies stopword removal
text = [w for w in clean if w not in customStopwords]
token_joined.append(text)
Since a word of interest is guarantee
, here is a list of how many times that word (and its variations) appear in each record.
In [ ]:
#for title,word in zip(file,token_joined):
# print(title['title'],"guarantee:", word.count('guarantee'), "guarantees:", \
# word.count('guarantees'), "guaranteed:", word.count('guaranteed'))
In [132]:
for title,word in zip(file,token_joined):
print(title['title'],"service:", word.count('service'),"services:", word.count('services'))
In [138]:
# splits the text from the record into a list of individual words
words = joined_text[0].split()
#assigns NLTK functionality to the text
text = nltk.Text(words)
In [135]:
# prints a concordance output for the selected word (shown in green)
print(text.concordance('services', lines=25))
In [140]:
#creates a new file that can be written by the print queue
fileconcord = codecs.open('April11_service_concord.txt', 'w', 'utf-8')
#makes a copy of the empty print queue, so that we can return to it at the end of the function
tmpout = sys.stdout
#stores the text in the print queue
sys.stdout = fileconcord
#generates and prints the concordance, the number pertains to the total number of bytes per line
text.concordance("service", 79, sys.maxsize)
#closes the file
fileconcord.close()
#returns the print queue to an empty state
sys.stdout = tmpout
Below is what the text looks like after the initial processing, without punctuation, numbers, or stopwords.
In [ ]:
# shows the text list for a given record. Changing the first number displays a...
# ...different record, changing/removing the second number limits/expands the text shown
print(token_joined[5][:50])
In [141]:
# creates a variable for the lemmatizing function
wnl = WordNetLemmatizer()
# lemmatizes all of the verbs
lemm = []
for record in token_joined:
for word in record:
lemm.append(wnl.lemmatize(word, 'v'))
'''
lemm = []
for word in token_joined[13]:
lemm.append(wnl.lemmatize(word, 'v'))
'''
# lemmatizes all of the nouns
lems = []
for word in lemm:
lems.append(wnl.lemmatize(word, 'n'))
Here we are checking to make sure the lemmatizer has worked. Now the word guarantee
only appears in one form.
In [ ]:
# just making sure the lemmatizer has worked
#print("guarantee:", lems.count('guarantee'), "guarantees:", \
# lems.count('guarantees'), "guaranteed:", lems.count('guaranteed'))
In [142]:
print("service:", lems.count('service'), lems.count('services'))
In [108]:
# counting the number of words in each record
for name, each in zip(file,token_joined):
print(name['title'], ":",len(each), "words")
Here we will count the five most common words in each record.
In [ ]:
docfreq = []
for words in token_joined:
docfreq.append(nltk.FreqDist(words))
In [ ]:
for name, words in zip(file_obj, docfreq):
print(name['title'], ":", words.most_common(5))
These are the 10 most common word pairs in the text.
In [143]:
# prints the 10 most common bigrams
colText = nltk.Text(lems)
colText.collocations(10)
Error checking to make sure the code is processing the text properly.
In [115]:
# creates a list of bigrams (ngrams of 2), printing the first 5
colBigrams = list(nltk.ngrams(colText, 2))
colBigrams[:5]
Out[115]:
More error checking.
In [ ]:
# error checking. There should be one less bigram than total words
print("Number of words:", len(lems))
print("Number of bigrams:", len(colBigrams))
Below is a frequency plot showing the occurence of the 25 most frequent words.
In [ ]:
# frequency plot with stopwords removed
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 10.0)
fd = nltk.FreqDist(colText)
fd.plot(25)
In [144]:
# loads bigram code from NLTK
bigram_measures = nltk.collocations.BigramAssocMeasures()
# bigrams with a window size of 2 words
finder = BigramCollocationFinder.from_words(lems, window_size = 2)
# ngrams with 'word of interest' as a member
word_filter = lambda *w: 'service' not in w
# only bigrams that contain the 'word of interest'
finder.apply_ngram_filter(word_filter)
In [145]:
# filter results based on statistical test
# calulates the raw frequency as an actual number and percentage of total words
act = finder.ngram_fd.items()
raw = finder.score_ngrams(bigram_measures.raw_freq)
# log-likelihood ratio
log = finder.score_ngrams(bigram_measures.likelihood_ratio)
Research shows that this is the most reliable statistical test for unreliable data.
Log-Likelihood Ratio
The Log-likelihood ratio calculates the size and significance between the observed and expected frequencies of bigrams and assigns a score based on the result, taking into account the overall size of the corpus. The larger the difference between the observed and expected, the higher the score, and the more statistically significant the collocate is. The Log-likelihood ratio is my preferred test for collocates because it does not rely on a normal distribution, and for this reason, it can account for sparse or low frequency bigrams. It does not over-represent low frequency bigrams with inflated scores, as the test is only reporting how much more likely it is that the frequencies are different than they are the same. The drawback to the Log-likelihood ratio is that it cannot be used to compare scores across corpora.
An important note here that words will appear twice in the following list. As the ngrams can appear both before and after the word, care must be taken to identify duplicate occurences in the list below and then combine the totals.
In [118]:
# prints list of results.
print(tabulate(log, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", \
numalign="left"))
In [129]:
# prints list of results.
print(tabulate(act, headers = ["Collocate", "Actual"], floatfmt=".3f", \
numalign="left"))
In [97]:
with open('digital-literacy_collocate_Act.csv','w') as f:
w = csv.writer(f)
w.writerows(act)
Here is an example of words appearing twice. Below are both instances of the ngram 'quality'. The first instance appears before 'guarantee' and the second occurs after.
A bit more processing to clean up the list.
In [146]:
##################################################################
############### sorts list of log-likelihood scores ##############
##################################################################
# group bigrams by first and second word in bigram
prefix_keys = collections.defaultdict(list)
for key, l in log:
# first word
prefix_keys[key[0]].append((key[1], l))
# second word
prefix_keys[key[1]].append((key[0], l))
# sort bigrams by strongest association
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
# prints top 80 results
logkeys = prefix_keys['service'][:80]
Here is a list showing only the collocates for the word guarantee
. Again, watch for duplicate words below.
In [147]:
from tabulate import tabulate
print(tabulate(logkeys, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", \
numalign="left"))
In [148]:
with open('service_collocate_Log.csv','w') as f:
w = csv.writer(f)
w.writerows(logkeys)
In [ ]:
# working on a regex to split text by speaker
diced = []
for words in joined_text:
diced.append(re.split('(\d+(\s)\w+[A-Z](\s|.\s)\w+[A-Z]:\s)', words))
In [ ]:
print(diced[8])
In [ ]:
init_names = []
for words in joined_text:
init_names.append(set(re.findall('[A-Z]{3,}', words)))
In [ ]:
print(init_names)
In [ ]:
with open('initialNames.csv','w') as f:
w = csv.writer(f)
w.writerows(init_names)