CRTC Hearing Text Analysis

The purpose of this notebook is to illustrate the method of text analysis using a corpus created from digital content published by the CRTC. This is the second part in a two-part process, the first of which is a description of the code that 'scraped' the CRTC webpage to create the corpus.

Setting Up

The code below imports the modules that are required to process the text.


In [63]:
# importing code modules
import json
import ijson
from ijson import items

import pprint
from tabulate import tabulate

import matplotlib.pyplot as plt

import re
import csv
import sys
import codecs

import nltk
import nltk.collocations
import collections
import statistics
from nltk.metrics.spearman import *
from nltk.collocations import *
from nltk.stem import WordNetLemmatizer


# This is a function for reading the contents of files
def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = codecs.open(filename, 'r', 'utf-8')
    contents = infile.read()
    infile.close()
    return contents

Reading the File

This code loads and then reads the necessary files: the json file with all the hearing text, and a txt file with a list of stopwords, taken from here: http://www.lextek.com/manuals/onix/stopwords2.html. I've also added a few custom words.


In [2]:
# loading the JSON file
filename = "../scrapy/hearing_result6.json"

# loading the stopwords file
stopwords = read_file('cornellStopWords.txt')
customStopwords = stopwords.split()

In [3]:
# reads the file and assigns the keys and values to a Python dictionary structure
with open(filename, 'r') as f:
    objects = ijson.items(f, 'item')
    file = list(objects)

A bit of error checking here to confirm the number of records in the file. We should have 14.


In [4]:
# checks to see how many records we have
print(len(file))


14

Changing the number in the code below will print a different record from the file. Please remember that in coding, numbered lists begin at 0.


In [ ]:
# commenting this out to make the github notebook more readable.
# prints all content in a single record. Changing the number shows a different record
file[0]

Here is a bit more error checking to confirm the record titles and their urls.


In [ ]:
# iterates through each record in the file
for row in file:
    # prints the title of each record and its url
    print(row['title'], ":", row['url'])

And a bit more processing to make the text more readable. It's printed below.


In [5]:
# appends all of the text items to a single string object (rather than a list)
joined_text = []
for row in file:
    joined_text.append(' '.join(row['text']))

In [ ]:
# shows the text. Changing the number displays a different record...
# ...changing/removing the second number limits/expands the text shown.
print(joined_text[5][:750])

Text Analysis Processing

This is the begining of the first processing for the text analysis. Here we will split all the words apart, make them all lowercase, and remove the punctuation, numbers, and words on the stopword list.


In [6]:
# splits the text string in each record into a list of separate words
token_joined = []
for words in joined_text:
    # splits the text into a list of words
    text = words.split()
    # makes all words lowercase
    clean = [w.lower() for w in text if w.isalpha()]
    # applies stopword removal
    text = [w for w in clean if w not in customStopwords]
    token_joined.append(text)

Since a word of interest is guarantee, here is a list of how many times that word (and its variations) appear in each record.


In [ ]:
#for title,word in zip(file,token_joined):
   # print(title['title'],"guarantee:", word.count('guarantee'), "guarantees:", \
       #   word.count('guarantees'), "guaranteed:", word.count('guaranteed'))

In [132]:
for title,word in zip(file,token_joined):
    print(title['title'],"service:", word.count('service'),"services:", word.count('services'))


Transcript, Hearing April 11, 2016  service: 210 services: 103
Transcript, Hearing April 12, 2016  service: 166 services: 74
Transcript, Hearing April 13, 2016  service: 98 services: 49
Transcript, Hearing April 14, 2016  service: 98 services: 31
Transcript, Hearing April 15, 2016  service: 76 services: 31
Transcript, Hearing April 18, 2016  service: 215 services: 79
Transcript, Hearing April 19, 2016  service: 90 services: 46
Transcript, Hearing April 20, 2016  service: 118 services: 60
Transcript, Hearing April 21, 2016  service: 137 services: 52
Transcript, Hearing April 22, 2016  service: 30 services: 30
Transcript, Hearing April 25, 2016  service: 208 services: 60
Transcript, Hearing April 26, 2016  service: 157 services: 60
Transcript, Hearing April 27, 2016  service: 62 services: 28
Transcript, Hearing April 28, 2016  service: 34 services: 20

Concordance

It looks like record number 5 has the most occurences of the word guarantee. The code below isolates the record and creates a concordance based on the selected word.


In [138]:
# splits the text from the record into a list of individual words
words = joined_text[0].split()
#assigns NLTK functionality to the text
text = nltk.Text(words)

In [135]:
# prints a concordance output for the selected word (shown in green)
print(text.concordance('services', lines=25))


Displaying 25 of 79 matches:
tion like YouTube with captioning; services supplied online using sign languag
 sign language, for example, relay services like video relay interpreting and 
want to add that telecommunication services should be recognized as a basic se
 catch up. 7012 Broadband internet services must be defined as a basic service
indication display of what type of services would be offered for the deaf comm
’m unaware of any direct frontline services provided in sign language. But I a
us about getting rid of any of the services we have currently. 7072 Now, maybe
one who works at telecommunication services to understand deaf culture and per
d, so that we can provide the best services for our community going forward. 7
oned earlier that the packages and services available to the deaf community we
herwise, they’ll be unaware of any services being offered. 7100 MR. ROOTS: Jim
 enhance our understanding of what services are available and see which produc
ailable and see which products and services will meet our needs. 7108 And that
ordable and reliable communication services are increasingly essential, for th
ibility of media and communication services by 2020. 7177 The Access 2020 Coal
eliable, fixed and mobile internet services that are essential for participati
excluding broadband from the basic services framework. 7187 We recognize that 
t the affordability and quality of services that are available to them. 7191 E
cess to fixed and mobile broadband services and the need for deploying the wid
ents in improving accessibility of services they deliver or responding to the 
es can access basic communications services that are essential to their abilit
ccess and use basic communications services they need. 7206 Even though incumb
her information and communications services and applications. 7207 We submit t
cerns about affordability of basic services for Canadians with low incomes, as
to and use of basic communications services by Canadians with severe or very s
None

In [140]:
#creates a new file that can be written by the print queue
fileconcord = codecs.open('April11_service_concord.txt', 'w', 'utf-8')
#makes a copy of the empty print queue, so that we can return to it at the end of the function
tmpout = sys.stdout
#stores the text in the print queue
sys.stdout = fileconcord
#generates and prints the concordance, the number pertains to the total number of bytes per line
text.concordance("service", 79, sys.maxsize)
#closes the file
fileconcord.close()
#returns the print queue to an empty state
sys.stdout = tmpout

Below is what the text looks like after the initial processing, without punctuation, numbers, or stopwords.


In [ ]:
# shows the text list for a given record. Changing the first number displays a... 
# ...different record, changing/removing the second number limits/expands the text shown
print(token_joined[5][:50])

Lemmatization

Some more preparation for the text processing. The code below works on the all of the records, creating one master list of words which is then lemmatized.


In [141]:
# creates a variable for the lemmatizing function
wnl = WordNetLemmatizer()

# lemmatizes all of the verbs

lemm = []
for record in token_joined:
        for word in record:
            lemm.append(wnl.lemmatize(word, 'v'))
'''
lemm = []
for word in token_joined[13]:
        lemm.append(wnl.lemmatize(word, 'v'))
'''

# lemmatizes all of the nouns 
lems = []
for word in lemm:
    lems.append(wnl.lemmatize(word, 'n'))

Here we are checking to make sure the lemmatizer has worked. Now the word guarantee only appears in one form.


In [ ]:
# just making sure the lemmatizer has worked
#print("guarantee:", lems.count('guarantee'), "guarantees:", \
         # lems.count('guarantees'), "guaranteed:", lems.count('guaranteed'))

In [142]:
print("service:", lems.count('service'), lems.count('services'))


service: 2435 0

Word Frequency

Here is a count of the number of words in each record. While this data isn't terribly useful 'as is', we can make a few assumptions about the text here. Notably that some of the hearings were much longer than others.


In [108]:
# counting the number of words in each record 
for name, each in zip(file,token_joined):
    print(name['title'], ":",len(each), "words")


Transcript, Hearing April 11, 2016  : 16664 words
Transcript, Hearing April 12, 2016  : 12891 words
Transcript, Hearing April 13, 2016  : 8423 words
Transcript, Hearing April 14, 2016  : 8319 words
Transcript, Hearing April 15, 2016  : 4840 words
Transcript, Hearing April 18, 2016  : 13523 words
Transcript, Hearing April 19, 2016  : 10184 words
Transcript, Hearing April 20, 2016  : 8541 words
Transcript, Hearing April 21, 2016  : 12865 words
Transcript, Hearing April 22, 2016  : 3400 words
Transcript, Hearing April 25, 2016  : 14791 words
Transcript, Hearing April 26, 2016  : 11804 words
Transcript, Hearing April 27, 2016  : 7454 words
Transcript, Hearing April 28, 2016  : 6803 words

Here we will count the five most common words in each record.


In [ ]:
docfreq = []
for words in token_joined:
    docfreq.append(nltk.FreqDist(words))

In [ ]:
for name, words in zip(file_obj, docfreq):
    print(name['title'], ":", words.most_common(5))

These are the 10 most common word pairs in the text.


In [143]:
# prints the 10 most common bigrams
colText = nltk.Text(lems)
colText.collocations(10)


service provider; market force; digital literacy; basic service; data
cap; eastern ontario; fix wireless; private sector; low income; rural
remote

Error checking to make sure the code is processing the text properly.


In [115]:
# creates a list of bigrams (ngrams of 2), printing the first 5
colBigrams = list(nltk.ngrams(colText, 2)) 
colBigrams[:5]


Out[115]:
[('hear', 'april'),
 ('april', 'quebec'),
 ('quebec', 'april'),
 ('april', 'copyright'),
 ('copyright', 'reserve')]

More error checking.


In [ ]:
# error checking. There should be one less bigram than total words
print("Number of words:", len(lems))
print("Number of bigrams:", len(colBigrams))

Below is a frequency plot showing the occurence of the 25 most frequent words.


In [ ]:
# frequency plot with stopwords removed
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 10.0)
fd = nltk.FreqDist(colText)
fd.plot(25)

Collocations

Here we are preparing the text to search for bigrams containing the word guarantee. This code searches for words appearing before and after guarantee with a window size of two words on either side.


In [144]:
# loads bigram code from NLTK
bigram_measures = nltk.collocations.BigramAssocMeasures()
# bigrams with a window size of 2 words
finder = BigramCollocationFinder.from_words(lems, window_size = 2)
# ngrams with 'word of interest' as a member
word_filter = lambda *w: 'service' not in w
# only bigrams that contain the 'word of interest'
finder.apply_ngram_filter(word_filter)

In [145]:
# filter results based on statistical test

# calulates the raw frequency as an actual number and percentage of total words
act = finder.ngram_fd.items()
raw = finder.score_ngrams(bigram_measures.raw_freq)
# log-likelihood ratio
log = finder.score_ngrams(bigram_measures.likelihood_ratio)

Research shows that this is the most reliable statistical test for unreliable data.

Log-Likelihood Ratio

The Log-likelihood ratio calculates the size and significance between the observed and expected frequencies of bigrams and assigns a score based on the result, taking into account the overall size of the corpus. The larger the difference between the observed and expected, the higher the score, and the more statistically significant the collocate is. The Log-likelihood ratio is my preferred test for collocates because it does not rely on a normal distribution, and for this reason, it can account for sparse or low frequency bigrams. It does not over-represent low frequency bigrams with inflated scores, as the test is only reporting how much more likely it is that the frequencies are different than they are the same. The drawback to the Log-likelihood ratio is that it cannot be used to compare scores across corpora.

An important note here that words will appear twice in the following list. As the ngrams can appear both before and after the word, care must be taken to identify duplicate occurences in the list below and then combine the totals.


In [118]:
# prints list of results. 
print(tabulate(log, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", \
               numalign="left"))


Collocate                        Log-Likelihood
-------------------------------  ----------------
('digital', 'literacy')          450.282
('literacy', 'skill')            39.179
('literacy', 'train')            12.498
('iterate', 'literacy')          11.203
('literacy', 'critically')       11.203
('literacy', 'nowadays')         11.203
('literacy', 'telecommunities')  11.203
('literacy', 'mean')             9.784
('literacy', 'conceive')         7.399
('literacy', 'throw')            7.399
('literacy', 'express')          6.726
('literacy', 'factor')           6.726
('literacy', 'free')             6.726
('literacy', 'poverty')          6.228
('literacy', 'primary')          5.833
('literacy', 'actual')           5.505
('literacy', 'reflect')          5.505
('literacy', 'option')           5.226
('literacy', 'potential')        4.983
('literacy', 'benefit')          4.767
('literacy', 'enhance')          4.767
('literacy', 'hand')             4.767
('literacy', 'responsibility')   4.767
('literacy', 'involve')          4.574
('literacy', 'extent')           4.400
('literacy', 'job')              4.400
('literacy', 'couple')           3.832
('literacy', 'essential')        3.832
('literacy', 'guess')            3.221
('literacy', 'comment')          2.982
('literacy', 'ensure')           2.982
('literacy', 'increase')         2.982
('literacy', 'country')          2.839
('literacy', 'move')             2.773
('literacy', 'bite')             2.647
('bite', 'literacy')             2.647
('literacy', 'time')             2.530
('literacy', 'build')            2.370
('literacy', 'kind')             1.826
('literacy', 'part')             1.792
('literacy', 'work')             1.164
('fund', 'literacy')             1.144
('literacy', 'issue')            0.930
('literacy', 'digital')          0.854
('literacy', 'make')             0.797
('literacy', 'people')           0.441

In [129]:
# prints list of results. 
print(tabulate(act, headers = ["Collocate", "Actual"], floatfmt=".3f", \
               numalign="left"))


Collocate                       Actual
------------------------------  --------
('literacy', 'responsibility')  1
('ask', 'literacy')             1
('literacy', 'gap')             1
('bridge', 'literacy')          1
('literacy', 'support')         1
('literacy', 'skill')           1
('literacy', 'objective')       1
('literacy', 'piece')           3
('literacy', 'largely')         1
('literacy', 'problem')         2
('literacy', 'digital')         1
('thought', 'literacy')         1
('literacy', 'huge')            1
('digital', 'literacy')         10

In [97]:
with open('digital-literacy_collocate_Act.csv','w') as f:
    w = csv.writer(f)
    w.writerows(act)

Here is an example of words appearing twice. Below are both instances of the ngram 'quality'. The first instance appears before 'guarantee' and the second occurs after.

Collocate Log-Likelihood ---------------------------- ---------------- ('quality', 'guarantee') 76.826 ('guarantee', 'quality') 3.955

A bit more processing to clean up the list.


In [146]:
##################################################################
############### sorts list of log-likelihood scores ##############
##################################################################

# group bigrams by first and second word in bigram                                        
prefix_keys = collections.defaultdict(list)
for key, l in log:
    # first word
    prefix_keys[key[0]].append((key[1], l))
    # second word
    prefix_keys[key[1]].append((key[0], l))
    
# sort bigrams by strongest association                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

    # prints top 80 results
logkeys = prefix_keys['service'][:80]

Here is a list showing only the collocates for the word guarantee. Again, watch for duplicate words below.


In [147]:
from tabulate import tabulate
print(tabulate(logkeys, headers = ["Collocate", "Log-Likelihood"], floatfmt=".3f", \
               numalign="left"))


Collocate          Log-Likelihood
-----------------  ----------------
provider           1300.460
basic              1084.240
telecommunication  328.872
internet           267.560
quality            221.506
objective          180.661
provide            164.326
broadband          141.885
telephone          135.467
universal          107.138
social             101.779
minimum            93.220
wireless           90.850
communication      90.108
level              87.269
essential          74.443
telecom            69.281
relay              58.368
quality            55.463
level              47.471
improve            41.829
offer              41.147
include            39.013
rural              37.936
provide            36.162
voice              33.907
deliver            32.591
meg                27.341
agency             27.287
obligation         24.594
high               23.735
discretionary      22.993
offer              22.866
emergency          22.337
satellite          21.426
cell               21.202
product            20.738
cellular           20.349
extend             20.070
improvement        19.911
dsl                18.344
deliver            17.959
delivery           17.797
overbuilt          16.223
telecomm           16.223
migrate            16.122
fund               15.920
canadian           14.736
package            14.663
voip               14.450
definition         13.657
comparable         13.302
hear               12.441
rep                12.439
issue              12.342
area               12.166
purchase           11.911
luxury             11.724
outage             11.724
application        11.496
web                11.414
provision          11.207
bundle             11.192
obtain             10.892
work               10.836
satisfactory       10.747
type               10.657
thing              10.461
fibre              10.412
mko                10.366
directly           10.173
wireline           10.120
subsidy            10.043
canadian           9.799
delivery           9.625
wrap               9.597
point              9.336
enhance            9.282
sort               9.225
extend             9.139

In [148]:
with open('service_collocate_Log.csv','w') as f:
    w = csv.writer(f)
    w.writerows(logkeys)


In [ ]:
# working on a regex to split text by speaker
diced = []
for words in joined_text:
    diced.append(re.split('(\d+(\s)\w+[A-Z](\s|.\s)\w+[A-Z]:\s)', words))

In [ ]:
print(diced[8])

In [ ]:
init_names = []
for words in joined_text:
    init_names.append(set(re.findall('[A-Z]{3,}', words)))

In [ ]:
print(init_names)

In [ ]:
with open('initialNames.csv','w') as f:
    w = csv.writer(f)
    w.writerows(init_names)