SUMMARY: This IPython Notebook shows you how to use
corpkit
to investigate a corpus of paragraphs containing the word risk in the NYT between 1963 and 2014.
First, let's import the functions we'll be using to investigate the corpus. These functions are designed for this interrogation, but also have more general use in mind, so you can likely use them on your own corpora.
Function name | Purpose | |
---|---|---|
interrogator() |
interrogate parsed corpora | |
dependencies() |
interrogate parsed corpora for dependency info (presented later) | |
plotter() |
visualise interrogator() results |
|
table() |
return plotter() results as table | |
quickview() |
view interrogator() results |
|
tally() |
get total frequencies for interrogator() results |
|
surgeon() |
edit interrogator() results |
|
merger() |
merge interrogator() results |
|
conc() |
complex concordancing of subcorpora | |
keywords() |
get keywords and ngrams from conc() output |
|
collocates() |
get collocates from conc() output |
|
quicktree() |
visually represent a parse tree | |
searchtree() |
search a parse tree with a Tregex query |
In [ ]:
import corpkit
from corpkit import interrogator, plotter
# show visualisations inline:
%matplotlib inline
Next, let's set the path to our corpus. If you were using this interface for your own corpora, you would change this to the path to your data.
In [ ]:
# to unzip nyt files:
# gzip -dc data/nyt.tar.gz | tar -xf - -C data
# corpus with annual subcorpora
annual_trees = 'data/nyt/years'
Our main corpus is comprised of paragraphs from New York Times articles that contain a risk word, which we have defined by regular expression as '(?i)'.?\brisk.?\b'. This includes low-risk, or risk/reward as single tokens, but excludes brisk or asterisk.
The data comes from a number of sources.
In total, 149,504 documents were processed. The corpus from which the risk corpus was made is over 150 million words in length!
The texts have been parsed for part of speech and grammatical structure by `Stanford CoreNLP*. In this Notebook, we are only working with the parsed versions of the texts. We rely on Tregex to interrogate the corpora. Tregex allows very complex searching of parsed trees, in combination with Java Regular Expressions. It's definitely worthwhile to learn the Tregex syntax, but in case you're time-poor, at the end of this notebook are a series of Tregex queries that you can copy and paste into *interrogator()and
conc()` queries.
So, let's start by generating some general information about this corpus. First, let's define a query to find every word in the corpus. Run the cell below to define the allwords_query
variable as the Tregex query to its right.
When writing Tregex queries or Regular Expressions, remember to always use
r'...'
quotes!
In [ ]:
# any token containing letters or numbers (i.e. no punctuation):
allwords_query = r'/[A-Za-z0-9]/ !< __'
Next, we perform interrogations with interrogator()
. Its most important arguments are:
path to corpus
Tregex options:
the Tregex query
We only need to count tokens, so we can use the -C option (it's often faster than getting lists of matching tokens). The cell below will run interrogator()
over each annual subcorpus and count the number of matches for the query.
In [ ]:
allwords = interrogator(annual_trees, '-C', allwords_query)
When the interrogation has finished, we can view our results:
In [ ]:
# from the allwords results, print the totals
print allwords.totals
If you want to see the query and options that created the results, you can use:
In [ ]:
print allwords.query
Lists of years and totals are pretty dry. Luckily, we can use the plotter()
function to visualise our results. At minimum, plotter()
needs two arguments:
There is also an argument for projecting the 1963 and 2014 results, which can either be set to true or false. By default, it is true, and in this Notebook from now on we'll leave it turned on. We'll try both options here:
In [ ]:
plotter('Word counts in each subcorpus', allwords.totals, projection = False)
plotter('Word counts in each subcorpus (projected)', allwords.totals, projection = True)
Great! So, we can see that the number of words per year varies quite a lot. That's worth keeping in mind.
Next, let's count the total number of risk words. Notice that we are using the '-o' flag, instead of the -C flag.
In [ ]:
# our query:
riskwords_query = r'__ < /(?i).?\brisk.?\b/' # any risk word and its word class/part of speech
# get all risk words and their tags:
riskwords = interrogator(annual_trees, '-o', riskwords_query)
Even when do not use the -C
flag, we can access the total number of matches as before:
In [ ]:
plotter('Risk words', riskwords.totals)
At the moment, it's hard to tell whether or not these counts are simply because our annual NYT samples are different sizes. To account for this, we can calculate the percentage of parsed words that are risk words. This means combining the two interrogations we have already performed.
We can do this by passing a third argument to plotter()
.
In [ ]:
plotter('Relative frequency of risk words', riskwords.totals,
fract_of = allwords.totals)
That's more helpful. We can now see some interesting peaks and troughs in the proportion of risk words. We can also see that 1963 contains the highest proportion of risk words. This is because the manual corrector of 1963 OCR entries preserved only the sentence containing risk words, rather than the paragraph.
It's often helpful to not plot 1963 results for this reason. To do this, we can add an argument to the plotter()
call:
In [ ]:
plotter('Relative frequency of risk words', riskwords.totals,
fract_of = allwords.totals, skip63 = True)
Perhaps we're interested in not only the frequency of risk words, but the frequency of different kinds* of risk words. We actually already collected this data during our last *interrogator()
query.
We can print just the first three entries of the results list, rather than the totals list:
In [ ]:
for word in riskwords.results[:3]:
print word
# uncomment below to print the totals:
# print riskwords.totals
We now have enough data to do some serious plotting.
In [ ]:
plotter('Risk word / all risk words', riskwords.results,
fract_of = riskwords.totals)
plotter('Risk word / all words', riskwords.results,
fract_of = allwords.totals)
By default, plotter()
plots the seven most frequent results, including 1963 and projecting 1963 and 2014.
We can use other plotter()
arguments to customise what our chart shows. plotter()
's possible arguments are:
plotter() argument | Mandatory/default? | Use | Type |
---|---|---|---|
title | mandatory | A title for your plot | string |
`results* | mandatory | the results you want to plot | *interrogator()` total |
fract_of | None | results for plotting relative frequencies/ratios etc. | list (interrogator('c') form) |
num_to_plot | 7 | number of top results to display | integer |
sort_by | 'increase'/'decrease'/'static'/'total' | show results increasing or decreasing the most in frequency | string |
skip63 | False | do not plot 1963 | integer |
proj63 | 4 | multiplier to project 1963 results and totals | integer |
multiplier | 100 | result * multiplier / total: use 1 for ratios | integer |
x_label | False | custom label for the x-axis | string |
y_label | False | custom label for the y-axis | string |
legend_totals | False | Print total/rel freq in legend | boolean |
legend_p | False | Print p value of increase/decrease slope in legend | boolean |
projection | True | Project 1963 and 2014 editions | boolean |
yearspan | False | plot a span of years | a list of two int years |
csvmake | False | make csvmake the title of csv output file | string |
save | False | save generated image (True = with title as name) | True/False/string |
You can easily use these to get different kinds of output. Try changing some parameters below:
In [ ]:
plotter('Relative frequencies of risk words', riskwords.results, fract_of = allwords.totals,
y_label = 'Percentage of all risk words', num_to_plot = 5,
skip63 = False, projection = True, proj63 = 5, csvmake = 'riskwords.csv', legend_totals = True)
If you just generated a csv file, you can quickly get the results with:
In [ ]:
!cat 'riskwords.csv' | head -n 7
# and to delete it:
#!rm 'riskwords.csv'
# Use *yearspan* or *justyears* to specify years of interest:
In [ ]:
plotter('Relative frequencies of risk words', riskwords.results, fract_of = allwords.totals,
y_label = 'Percentage of all risk words', num_to_plot = 5, skip63 = False,
yearspan = [1963,1998])
Another way to change plotter()
visualisations is by not passing certain results to plotter()
.
Each entry in the list of results is indexed: the top result is item 0, the second result is item 1, and so on.
So, you can skip the first 2 results by using [2:] after the results list:
In [ ]:
plotter('Relative frequencies of risk words', riskwords.results[2:], fract_of = allwords.totals,
y_label = 'Percentage of all risk words', num_to_plot = 5, skip63 = False, projection = True, proj63 = 5, legend_totals = True)
If you are after a specific set of indexed items, it's probably better to use surgeon()
(described below). For completeness, though, here's another way:
In [ ]:
indices_we_want = [32,30,40]
plotter('Relative frequencies of risk words', [ riskwords.results[i] for i in indices_we_want],
num_to_plot = 5, skip63 = True, projection = True, proj63 = 5)
Another neat thing you can do is save the results of an interrogation, so they don't have to be run the next time you load this notebook:
In [ ]:
# specify what to save, and a name for the file.
from corpkit import save_result, load_result
save_result(allwords, 'allwords')
You can then load these results:
In [ ]:
fromfile_allwords = load_result('allwords')
fromfile_allwords.totals
If you want to quickly table the results of a csv file, you can use table()
. Its only main argument is the path to the csv file as string. There are two optional arguments. First, you can set allresults
to True
to table all results, rather than just the plotted results. When this option is set to true, you may get way too many results. To cope with this, there is a maxresults
argument, whose value by default is 50. You can overwrite this default to table more or fewer results.
In [ ]:
table('riskwords.csv')
In [ ]:
table('riskwords.csv', allresults = True, maxresults = 30)
quickview()
is a function that quickly shows the n most frequent items in a list. Its arguments are:
interrogator()
result
In [ ]:
quickview(riskwords.results, n = 25)
The number shown next to the item is its index. You can use this number to refer to an entry when editing results.
tally()
simply displays the total occurrences of results. Its first argument is the list you want tallies from. For its second argument, you can use:
In [ ]:
tally(riskwords.results, [0, 5, 10])
In [ ]:
tally(riskwords.results[:10], 'all')
The Regular Expression option is useful for merging results (see below).
Results lists can be edited quickly with surgeon()
. surgeon()
's arguments are:
interrogator()
results listBy default, surgeon()
keeps anything matching the regex, but this can be inverted with a remove = True argument. Because you are duplicating the original list, you don't have to worry about deleting interrogator()
results.
In [ ]:
# low and high risks, using indices
lowhighrisks = surgeon(riskwords.results, [4, 9, 17]) # keep 4, 9 and 17
plotter('Low-, high- and higher- risk', lowhighrisks.results, num_to_plot = 3, skip63 = True)
# only hyphenate words:
nohyphenates = surgeon(riskwords.results, r'\b.*-.*\b', remove = True) # remove tokens with hyphens
quickview(nohyphenates.results)
plotter('Non-hypenate risk words', nohyphenates.results, fract_of = riskwords.totals,
y_label = 'Percentage of all risk words', num_to_plot = 7, skip63 = True)
# only verbal risk words
verbalrisks = surgeon(riskwords.results, r'^\(v.*') #keep any token with tag starting with 'v'
plotter('Verbal risk words', verbalrisks, fract_of = allwords.totals,
y_label = 'Percentage of all words', num_to_plot = 6, skip63 = True)
Note the warning you'll receive if you specify an interrogation, rather than a results list.
merger()
is for merging items in a list. Like surgeon()
, it duplicates the old list. Its arguments are:
In [ ]:
low_high_combined = merger(lowhighrisks.results, [0, 2], newname = 'high/higher risk')
plotter('Low and high risks', low_high_combined.results)
It's important to note that the kind of results we generate are hackable. Using some straight Python, combined with merger()
, we can figure out how unique risk words appear in the NYT each year.
To do this, we can take riskwords.results
, duplicate it, and change every count over 0 into 1.
In [ ]:
import copy # let's not modify our old list
all_ones = copy.deepcopy(riskwords.results)
for entry in all_ones:
for tup in entry[1:]:
if tup[1] > 0:
tup[1] = 1
We can then use merger()
to merge every entry. This will tell use how many unique words there are each year.
In [ ]:
# this generates heaps of output, so let's clear it
mergedresults = merger(all_ones.results, r'.*', newname = 'Different risk words')
clear_output()
In [ ]:
# you could also use mergedresults.results[0]
plotter('Diversity of risk words', mergedresults.totals,
skip63 = True, y_label = 'Unique risk words')
So, we can see a generally upward trajectory, with more risk words constantly being used. Many of these results appear once, however, and many are nonwords. Can you figure out how to remove words that appear only once per year?
conc()
produces concordances of a subcorpus based on a Tregex query. Its main arguments are:
In [ ]:
# here, we use a subcorpus of politics articles,
# rather than the total annual editions.
lines = conc('data/nyt/trees/politics/1999', r'/JJ.?/ << /(?i).?\brisk.?\b/') # adj containing a risk word
You can set conc()
to print n random concordances with the random = n parameter. You can also store the output to a variable for further searching.
In [ ]:
lines = randoms = conc('data/nyt/trees/years/2007', r'/VB.?/ < /(?i).?\brisk.?\b/', random = 25)
conc()
takes another argument, window, which alters the amount of co-text appearing either side of the match. The default is 50 characters
In [ ]:
lines = conc('data/nyt/trees/health/2013', r'/VB.?/ << /(?i).?\brisk.?\b/', random = 25, window = 20)
conc()
also allows you to view parse trees. By default, it's false:
In [ ]:
lines = conc('data/nyt/trees/years/2013', r'/VB.?/ < /(?i)\btrad.?/', trees = True)
The final conc()
argument is a csv = 'filename', which will produce a tab-separated spreadsheet with the results of your query. You can copy and paste this data into Excel.
In [ ]:
lines = conc('data/nyt/trees/years/2005', r'/JJ.?/ < /(?i).?\brisk.?/ > (NP <<# /(?i)invest.?/)',
window = 30, trees = False, csvmake = 'concordances.csv')
In [ ]:
! cat 'concordances.csv'
# and to delete it:
# ! rm 'concordances.txt'
There are also functions for keywording, ngramming and collocation. Each can take a number of kinds of input data:
conc()
conc()
) keywords()
produces both keywords and ngrams. It relies on code from the Spindle project.
In [ ]:
keys, ngrams = keywords('concordances.csv')
for key in keys[:10]:
print key
for ngram in ngrams:
print ngram
You can even use interrogator()
to get keywords or ngrams from each subcorpus. To do this, use 'keywords' or 'ngrams' as the query argument.
In [ ]:
kwds = interrogator(annual_trees, 't', 'keywords')
In [ ]:
plotter('Keywords', kwds.results)
With the collocates()
function, you can specify the maximum distance at which two tokens will be considered collocates. The default is 5.
In [ ]:
colls = collocates('data/nyt/years/1989')
for coll in colls:
print coll
In [ ]:
colls = collocates('concordances.csv', window = 2)
for coll in colls:
print coll
The two functions are useful for visualising and searching individual syntax trees. They might be useful if you want to practive your Tregex queries, or make sure you are getting the expected result.
The easiest place to get a parse tree is from a CSV file generated using conc()
with trees set to True. Alternatively, you can open files in the data directory directly.
quicktree()
generates a visual representation of a parse tree. Here's one from 1989:
In [ ]:
tree = '(ROOT (S (NP (NN Pre-conviction) (NN attachment)) (VP (VBZ carries) (PP (IN with) (NP (PRP it))) (NP (NP (DT the) (JJ obvious) (NN risk)) (PP (IN of) (S (VP (VBG imposing) (NP (JJ drastic) (NN punishment)) (PP (IN before) (NP (NN conviction)))))))) (. .)))'
quicktree(tree)
searchtree()
requires a tree and a Tregex query. It will return a list of query matches.
In [ ]:
print searchtree(tree, r'/VB.?/ >># (VP $ NP)')
print searchtree(tree, r'NP')
Systemic functional linguistics argues that the main verb (the process) can be one of a few types.
In [ ]:
from dictionaries.process_types import processes
print processes.relational
print processes.verbal
We can use these in our Tregex queries to look for the kinds of processes participant risks are involved in. First, let's get a count for all processes with risk participants:
In [ ]:
query = r'/VB.?/ < /%s/ ># (VP ( < (NP <<# /(?i).?\brisk.?/)))' % processes.relational
relationals = interrogator(annual_trees, '-t', query, lemmatise = True)
In [ ]:
plotter('Relational processes', relationals.results, fract_of = proc_w_risk_part.totals)
# > You can also use the process types as regexes in `merger()` and `surgeon()` to merge/remove results entries.
We searched to find the most common proper noun strings.
interrogator()
's titlefilter option removes common titles, first names and determiners to make for more accurate counts. It is useful when the results being returned are groups/phrases, rather than single words.
In [ ]:
# Most common proper noun phrases
query = r'NP <# NNP >> (ROOT << /(?i).?\brisk.?\b/)'
propernouns = interrogator(annual_trees, '-t', query,
titlefilter = True)
In [ ]:
plotter('Most common proper noun phrases', propernouns.results, fract_of = propernouns.totals)
In [ ]:
quickview(propernouns.results, n = 200)
Notice that there are a few entries here that refer to the same group. (f.d.a and food and drug administration, for example). We can use merger()
to fix these.
In [ ]:
# indices change after merger, remember, so
# make sure you quickview results after every merge.
merged_propernouns = merger(propernouns.results, [13, 20])
merged_propernouns = merger(merged_propernouns, [8, 32])
merged_propernouns = merger(merged_propernouns, [42, 107])
merged_propernouns = merger(merged_propernouns, [60, 111])
merged_propernouns = merger(merged_propernouns, [183, 197])
merged_propernouns = merger(merged_propernouns, [65, 127])
merged_propernouns = merger(merged_propernouns, [84, 149], newname = 149)
merged_propernouns = merger(merged_propernouns, [23, 130])
quickview(merged_propernouns, n = 200)
Now that we've merged some common results, we can use surgeon()
to build some basic thematic categories.
In [ ]:
# make some new thematic lists
people = surgeon(merged_propernouns, r'(?i)^\b(bush|clinton|obama|greenspan|gore|johnson|mccain|romney'
r'|kennedy|giuliani|reagan)$\b')
nations = surgeon(merged_propernouns, r'(?i)^\b(iraq|china|america|israel|russia|japan|frace|germany|iran\
|britain|u\.s\.|afghanistan|australia|canada|spain|mexico|pakistan|soviet union|india)$\b')
geopol = surgeon(merged_propernouns, r'(?i)^\b(middle east|asia|europe|america|soviet union|european union)$\b')
#usplaces = surgeon(merged_propernouns, r'(?i)^\b(new york|washington|wall street|california|manhattan\
#|new york city|new jersey|north korea|italy|greece|bosniaboston|los angeles|broadway|texas)$\b',\)
companies = surgeon(merged_propernouns, r'(?i)^\b(merck|avandia\
|citigroup|pfizer|bayer|enron|apple|microsoft|empire)$\b')
organisations = surgeon(merged_propernouns, r'(?i)^\b((white house|congress|federal reserve|nasa|pentagon)\b|'
r'f\.d\.a\.|c\.i\.a\.|f\.b\.i\.|e\.p\.a\.)$')
medical = surgeon(merged_propernouns, r'(?i)^\b(vioxx|aids|celebrex|f.d.a)\b')
# geopol[5][0] == u'e.u.'
In [ ]:
# plot some results
plotter('People', people, fract_of = propernouns.totals,
y_label = 'Percentage of all proper noun groups', skip63 = True)
plotter('Nations', nations, fract_of = propernouns.totals,
y_label = 'Percentage of all proper noun groups', skip63 = True)
plotter('Geopolitical entities', geopol, fract_of = propernouns.totals,
y_label = 'Percentage of all proper noun groups', skip63 = False)
plotter('Companies', companies, fract_of = propernouns.totals,
y_label = 'Percentage of all proper noun groups', skip63 = True)
plotter('Organisations', organisations, fract_of = propernouns.totals,
y_label = 'Percentage of all proper noun groups', skip63 = True)
plotter('Medicine', medical, fract_of = propernouns.totals, num_to_plot = 4,
y_label = 'Percentage of all proper noun groups', skip63 = True, save = True,)
These charts reveal some interesting patterns.
In [ ]:
vioxx = surgeon(propernouns.results, r'(?i)^\b(vioxx|merck)\b$')
plotter('Merck and Vioxx', vioxx, fract_of = propernouns.totals, skip63 = True)
plotter('Merck and Vioxx', vioxx, fract_of = propernouns.totals, yearspan = [1998,2012])
Vioxx was removed from shelves following the discovery that it increased the risk of heart attack. It's interesting how even though terrorism and war may come to mind when thinking of risk in the past 15 years, this health topic is easily more prominent in the data.
From here on out, it's up to you. Currently in development are resources for making parsed corpora, but really, that process only involves: