Concordance Output

A concordance is a method of text analysis that is somewhat similar to the generation of word frequency statistics, only the search is expanded to the words that appear on either side of the word under investigation. We call the main search word the 'node' and the words surrounding it the 'span'. A condordance is simply a printed list displaying the sentences or 'context' that the node word appears in. This list is traditonally organized in a 'Key Word in Context' (KWIC) format, which has the node word in the centre of the page. The span can be adjusted, but generally includes about five words on the left and five words on the right of the node.

The purpose of generating a concordance output is to allow for manual, but controlled, examination of the word in question. As we will see in this exercise, it becomes very easy to recognize patterns of language use when the text is organized in this way. Further investigation can be conducted by sorting the list of text alphabetically, either on the word just to the left or right of the node word.

Generating a concordance output in Python is fairly simple thanks to the NLTK module. In this exercise we will generate a concordance output for one of our files.

Once again we will import our modules and definitions first. Here we see some new modules: NLTK, codecs, and sys.

NLTK stands for Natural Language Toolkit, which facilitates natural language processing in Python. NLTK has many functions that support electronic text analysis, including tokenizing, word frequency counters, and for the purposes of this demonstration, concordancers.

codecs is a module that helps Python read and write text in Unicode, which is a text encoding standard that includes non-alphanumeric characters. We will not be removing the capitalization or punctuation in this exercise, so we're using codecs to avoid any errors in reading and printing the file.

sys is a built-in Python module that allows for the manipulation of the Python runtime environment. Here we will use it to write the output of a program to a text file.



In [1]:

    
# This is where the modules are imported
import nltk
import sys
import codecs
from os import listdir
from os.path import splitext
from os.path import basename

# These functions extract the filename

def remove_ext(filename):
    "Removes the file extension, such as .txt"
    name, extension = splitext(filename)
    return name


def remove_dir(filepath):
    "Removes the path from the file name"
    name = basename(filepath)
    return name


def get_filename(filepath):
    "Removes the path and file extension from the file name"
    filename = remove_ext(filepath)
    name = remove_dir(filename)
    return name

# This function works on the contents of the file

def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = codecs.open(filename, 'r', 'utf-8')
    contents = infile.read()
    infile.close()
    return contents

For this demonstration we will focus only on one file, the 2013 section of the corpus. As evidenced in the last exercise, Adding Context to Word Frequency Counts, there was a significant increase in the usage of the word privacy between 2012 and 2013, which amounted to an increase of about 40%. Here we will take a closer look at 2013 in an attempt to identify any patterns of word use

This is a case where cleaning the text may also destroy some of the context. While it is nice to have the numbers line up (in terms of word frequencies vs. number of concordance lines), removing the punctuation and capitalization makes the text harder to read and understand.



In [2]:

    
#this is the path to the file we want to read
file = '../Counting Word Frequencies/data/2013.txt'

#this calls on a definition from above: it stores the filename as a variable to use later
name = get_filename(file)



In [3]:

    
#reads the file 
text = read_file(file)
#splits the text into a list of individual words
words = text.split()
#assigns NLTK functionality to the text
text = nltk.Text(words)

Here we will call the function, listing 25 lines from the text. Any other single word could be subsituted here for privacy. It can only be one word though, as the last piece of code split the text into single words, so a phrase will break the code. More or less lines can be shown by changing the number beside lines=.



In [4]:

    
print(text.concordance('privacy', lines=25))









    



Displaying 25 of 918 matches:
imply unacceptable. That is why the Privacy Commissioner's office was notified.
 the matter to the attention of the Privacy Commissioner of Canada. I also aske
table. That is why we called in the Privacy Commissioner and called in the RCMP
able. That is why we brought in the Privacy Commissioner. That is why we brough
ese victims and when will they take privacy protection seriously? (1455) Hon. D
 is why we took steps to inform the Privacy Commissioner of Canada and to bring
ystems to make sure that Canadians' privacy is protected. That is why we have e
 happened. We have also advised the Privacy Commissioner of the situation. We h
. Speaker, the government takes the privacy of Canadians extremely seriously. T
ely unacceptable. The Office of the Privacy Commissioner has been notified and 
nment takes extremely seriously the privacy of Canadians and the loss by the de
ion. We will continue to do so. The privacy commissioner is investigating this.
ed before, the government takes the privacy of Canadians extremely seriously— S
mentioned, the government takes the privacy of Canadians extremely seriously. T
p greater chances for fraud. As the Privacy Commissioner now conducts her inves
g Bob Zimmer Access to Information, Privacy and Ethics Chair: Pierre-Luc Dussea
stioned Conservative legislation on privacy concerns, we were accused of standi
006. According to the Office of the Privacy Commissioner, this is one of the la
he Information Commissioner and the Privacy Commissioner. I will try to delinea
the Information Commissioner or the Privacy Commissioner. Each of them are offi
ve seen the value of an independent Privacy Commissioner working on behalf of a
a government agency. Canada's first Privacy Commissioner, Inger Hansen, was wit
ssion at first, and then, under the Privacy Act, became an independent officer 
tion Commissioner, Auditor General, Privacy Commissioner, we say the Parliament
neral, Information Commissioner and Privacy Commissioner are all examples of of
None

The NLTK module is limited in the amount of processing it can conduct on concordances. It is more useful to output the entire concordance to a text file, which can then be sorted and manipulated in many ways. The following code prints the entire concordance to file. The '79' on line 8 refers to the total number of characters contained in each span, including all letters, punctuation and spaces.



In [5]:

    
#creates a new file that can be written by the print queue
fileconcord = codecs.open(name + '_collocates.txt', 'w', 'utf-8')
#makes a copy of the empty print queue, so that we can return to it at the end of the function
tmpout = sys.stdout
#stores the text in the print queue
sys.stdout = fileconcord
#generates and prints the concordance, the number pertains to the total number of bytes per line
text.concordance("privacy", 79, sys.maxsize)
#closes the file
fileconcord.close()
#returns the print queue to an empty state
sys.stdout = tmpout

Below is an example of a sorted concordance. This list is arranged alphabetically on the word to the right of the node.

As we can see from this list, and the list above, there are distinct patterns of word and phrase use. The unsorted list above contains many examples of "Office of the Privacy Commissioner", while the sorted list shows phrases like "privacy of Canadians" and "privacy protection". The manual examination of a concordance allows a researcher to understand how the node word is used in the context of the corpus, and it can help in the formation of hypotheses about the meaning of the word.

In the next exercise we will conduct a more robust statistical examination of the words that accompany the node word, which are known as 'collocates'. Concordance outputs provide a rich context for collocational analysis.