In this notebook, we build on a preliminary investigation of risk words in The New York Times, looking at five additional mainstream U.S. newspapers. See our GitHub repository, IPython Notebook, or project report for more information. The theoretical underpinnings of our work, for example, are outlined in detail in the report.
This Notebook assumes its reader has basic familiarity with Python and key linguistic concepts, such as word class, syntax, and lemma.
Sociologists such as Ulrich Beck and Anthony Giddens have characterised late modernity as a risk society, where risk plays an increasingly central role in both institutional structures and everyday life.
Functional linguists are interested in mapping the ways in which words and wordings come together to both construct and represent particular discourses and ideologies. Over the past fifty years, linguists working within functional traditions have mapped out in detail how language is employed by its users as a resource for negotiating interpersonal relationships and for representing doings and happenings in the world, or in consciousness.
Assuming that an increasing salience of risk in society will at least partly be reflected in the ways in which risk is discussed, it should therefore be possible to use real-world communication about risk to empirically examine sociological claims.
A number of technological developments make it possible to investigate risk semantics on a large scale:
That said, the use of corpus linguistics for discourse analysis is a relatively recent development, with available tools and methods still somewhat behind the state of the art resources available in computaional linguistics, natural language processing, etc.
Accordingly, we use corpkit, a purpose-built Python module for interrogating parsed corpora. The tool is also available as a graphical application, documented and downloadable here. Though our investigation could be performed using the graphical interface, we have instead opted for the command-line tools, which offer more flexibility, especially when working with multiple corpora simultaneously.
Our main interest was in determining whether findings from our investigation of the NYT could be generalised to other major U.S. newspapers.
This investigation makes it possible to describe, and hopefully explain, how risk words have behaved longitudinally in mainstream U.S. newspapers. Later work will connect these findings more explicitly to sociological claims.
Based on readership, location and digital availability, we selected the following six newspapers.
1. The New York Times
2. The Wall Street Journal
3. The Tampa Bay Times
4. USA Today
5. Chicago Tribune
6. The Washington Post
To do this, we used ProQuest to grab any articles in these newspapers from 1987-2015 that contained a risk word (defined by regular expression as (?i)\brisk
). This left us with over 500,000 articles!
Paragraphs with risk words were extracted and parsed using Stanford CoreNLP. Given the computationally intensive nature of parsing, we relied on high-performance computing resources from the University of Melbourne to run an embarrassingly parallel parsing script.
After a lot of parsing, we were left with six folders (one for each newspaper). Each folder contained annual subcorpora for 1987--2014. In these folders were XML files containing CoreNLP output annotation for each paragraph containing a risk word.
Once we had the data annotated, the challenges were to:
Computationally, we used corpkit to search the constituency and dependency parses for lexicogrammatical and discourse-semantic sites of change. Theoretically, we used concepts from systemic functional linguistics, which articulates the ways in which levels of linguistic abstraction are related, and the ways in which different kinds of meaning are realised in grammar and wording.
There are a number of fairly serious limitations that should be acknowledged upfront.
First, we are looking only at mainstream U.S. newspaper articles. Our findings do not reflect society generally, in the USA or otherwise: risk words likely behave very differently in different text types.
Computationally, we must acknowledge that parser accuracy may be an issue. Luckily, we're working with a kind of data that generally parses well, given that default parser models are in fact trained on U.S. news journalism.
In terms of linguistic theory, relevant concepts from systemic functional linguistics cannot be operationalised fully, given differences between the systemic-functional grammar and the grammars with which texts were annotated.
Finally, we're not really investigating the concept of risk, but only risk words. Risk as a concept can be construed without the word being present ("They had to decide which was safer ... "). We chose to focus only on risk words because there is less room for ambiguity about whether or not a risk is being construed, and to reduce our dataset to a manageable size. The current dataset, including parses of only paragraphs containing a risk word, is over 44 gigabytes. To parse all text from six newspapers over a 30 year period would perhaps be the largest amount of data ever parsed for a single academic project, requiring massive amounts of time and dedicated computational resources.
In [1]:
# show plots in this notebook
%matplotlib inline
# import corpkit
from corpkit import interrogator, editor, plotter, conc
# some wordlists we'll use later
from dictionaries.process_types import processes
from dictionaries.wordlists import wordlists
from dictionaries.roles import roles
# for editing/combining results:
import pandas as pd
We'll also need to set paths to our corpora:
In [2]:
nyt = 'data/NYT'
wsj = 'data/WSJ'
wap = 'data/WAP'
cht = 'data/CHT'
ust = 'data/UST'
tbt = 'data/TBT'
all_corpora = [nyt, wsj, wap, cht, ust, tbt]
If you already have data saved, you might want to load it all into memory now. This saves a lot of time, as running some queries over the entire corpus can take hours.
In [88]:
from corpkit.other import load_result
allwords = load_result('6_allwords_newest')
riskwords = load_result('6_riskwords_newest')
riskclasses = load_result('6_riskclasses_newest')
risktags = load_result('6_risktags_newest')
govrole = load_result('6_gr_collapsed_newest')
funct = load_result('6_fnct_newest')
riskers = load_result('6_noun_riskers_newest')
noun_lemmata = load_result('6_noun_lemmata_newest')
noun_riskers = load_result('6_noun_riskers_newest')
atrisk = load_result('6_atrisk_newest')
risky = load_result('6_risky_newest')
# another way to do this creates a dictionary, but we want to
# avoid nested dictionaries so that things are eaier
# to read:
#from corpkit import load_all_results
#r = load_all_results()
And finally, let's set our very simple regular expression for risk words:
In [4]:
# case insensitive matching a word boundary, followed by 'risk', then anything
# this allows at-risk, but disallows asterisk
riskword = r'(?i)\brisk'
Ready? OK, let's interrogate.
corpkit is essentially four functions and some wordlists. Here are the functions and their main arguments:
interrogator(corpus, searchtype, query, **optional_args)
searches corpora and tabulates raw frequency results
editor(result_to_edit, operation, denominator, **optional_args)
edits these results, by merging, skipping, renaming, sorting, keywording, summing, etc.
plotter(title, results_to_plot, **_optional_args)
allows us to visualiseinterrogator()
andeditor()
results
conc(subcorpus, searchtype, query, **optional_args)
concordances corpora. It can help show results in more detail.
The wordlists can be accessed like this:
In [5]:
print processes.relational
In [6]:
print wordlists.determiners
In [7]:
print roles.process
Wordlists can be used as queries, or as criteria to match during editing.
So, the first thing we'll need to do is get some basic stuff:
The basic syntax for using interrogator()
is to provide:
When interrogator()
gets a single string as its first argument, it treats the string as a path to a corpus, and outputs an object with query
, results
and totals
attributes. When it receives a list of strings, it understands that there are multiple corpora to search. Using parallel processing, it searches each one, and returns a dict
object with paths as keys and named tuple objects as values.
Note that our risk regular expression needs to be inside "/ /"
boundaries, because here we're using Tregex syntax.
In [ ]:
# returns a named tuple with results, totals and query:
allwords = interrogator(all_corpora, 'count', 'any', quicksave = '6_allwords')
# returns a dict with paths as keys and named tuples as values:
riskwords = interrogator(all_corpora, 'words', '/%s/' % riskword, quicksave = '6_riskwords')
risktags = interrogator(all_corpora, 'pos', '__ < /%s/' % riskword, quicksave = '6_risktags')
# the lemmatise option turns words to their stem
# form, but turns pos tags to their major word class
riskclasses = interrogator(all_corpora, 'pos', '__ < /%s/' % riskword,
lemmatise = True, quicksave = '6_riskclasses')
We can now set some data display options, and then view an example result:
In [8]:
pd.options.display.max_rows = 30
pd.options.display.max_columns = 6
allwords.results
Out[8]:
It's then quite easy to visualise this data:
In [11]:
plotter('Word counts in each corpus', allwords.results)
So, the word counts vary between papers, and across time. It's important that we always remember to deal with that issue. One way to do this is to make these into relative frequencies:
In [13]:
rel = editor(allwords.results, '%', allwords.totals)
In [14]:
rel.results
Out[14]:
Then, we can plot again:
In [15]:
plotter('Relative word counts in the subcorpora', rel.results)
In [399]:
# or, we could view this data cumulatively!
plotter('Cumulative number of words in the corpus', allwords.totals,
cumulative = True, figsize = (5,3))
print 'Total: %s words!' % "{:,}".format(allwords.totals.sum())
So, we have a fairly consistently sized dataset, with one major notable caveat: we have little data from USA Today until 1993. That's worth bearing in mind. Generally, we'll use relative frequencies, instead of absolute frequencies, in order to normalise our counts a little better.
allwords
, was simply counting tokens. As such, it could return a single dataframe as output. The other searches return dictionaries, with corpus names as keys and results as values:
In [11]:
print type(riskwords)
print type(riskwords['UST'])
print type(riskwords['UST'].results)
riskwords['UST'].results
Out[11]:
Each dictionary entry also has a totals count:
In [12]:
riskwords['WSJ'].totals
Out[12]:
If we want to visualise these totals, we can make a simply helper function to concatenate totals:
In [372]:
def get_totals(interrogation):
"""helper function: get totals from dict of interrogations"""
lst = []
# for each interrogation name and data
for k, v in interrogation.items():
# get the totals
tot = v.totals
# name the totals with the newspaper
tot.name = k.upper()
# add to a list
lst.append(tot)
# turn the list into a dataframe
return pd.concat(lst, axis = 1)
In [13]:
rwt = get_totals(riskwords)
rwt
Out[13]:
We might now like to determine the percentage of all words that are risk words in each newspaper:
In [27]:
# get risk words in each year in each newspaper as a percentage of
# all words in that same year and newspaper
rel = editor(rwt, '%', allwords.results)
plotter('Relative frequency of risk words by publication', rel.results)
A good starting point is to find out whether there is any change in the most common part-of-speech (POS) tags for risk words.
In [395]:
risktags['CHT'].results[:5]
Out[395]:
It might be nice to look generally at all the data, without worrying about individual newspapers. We can write another function to collapse the distinction between each corpus:
In [ ]:
def collapsed(interrogation):
import pandas as pd
order = list(interrogation.values()[0].results.columns)
df = interrogation.values()[0].results
for i in interrogation.values()[1:]:
df = df.add(i.results, fill_value = 0)
return df[order]
In [310]:
# collapse newspapers
tags = collapsed(risktags)
# relativise
rel_tags = editor(tags, '%', 'self')
# separate plots for each data point
plotter('Most common POS tags for risk words', rel_tags.results,
subplots = True, num_to_plot = 8, figsize = (7, 7), layout = (4, 2))
There are some mixed signals here. Risk as an adjective seems to decrease, while risk as a comparative adjective increases.
In [376]:
classes = collapsed(riskclasses)
rel_classes = editor(classes, '%', 'self', print_info = False)
plotter('Most common word classes of risk words', rel_classes.results,
kind = 'area', figsize = (6,7), colours = 'copper')
plotter('Most common word classes of risk words', rel_classes.results,
subplots = True, num_to_plot = 4, figsize = (6, 3), layout = (2, 2))
The clearest trends are toward nominalisation and away from verbal risk. Let's look at how they behave longitudinally in each publication:
In [329]:
for cls in ['Noun', 'Verb']:
print 'Relative frequency of %ss:' %cls.lower()
in_each = entry_across_corpora(riskclasses, cls)
rel = editor(in_each, '%', get_totals(riskwords), print_info = False)
plotter('Relative frequency of %ss' %cls, rel.results, subplots = True, layout = (2,3), figsize = (8,4))
So, we can see that these that trends in risk word behaviour are often generalisable across publications. This is a good start.
The limitation of this kind of approach, however, is that word classes and POS tags are formal features. Though they correlate with semantics (as nouns are likely to be things, and verbs are more likely to be events), it is a fairly loose correlation: to run a risk features a nominal risk, but semantically, it is not really a thing.
In [ ]:
funct = interrogator(all_corpora, 'function', riskword, quicksave = '6_riskfunct')
In [332]:
coll_funct = collapsed(funct)
inc = editor(coll_funct, '%', 'self', sort_by = 'increase', keep_top = 10,
keep_stats = True, remove_above_p = True, print_info = False)
#dec = editor(coll_funct, '%', 'self', sort_by = 'decrease', keep_top = 10,
#keep_stats = True, remove_above_p = True, print_info = False)
In [333]:
inc.results
Out[333]:
Interestingly, there are only two functions in which risk is increasingly common: dobj
(direct object) and amod
(adjectival modifier). Looking to the right-hand columns, we can see those results decreasing most in frequency. These display a striking patten: the top four results are each examples of risk as a process/predicator in either a main or embedded clause. This confirms to the results from our pilot study, where risk in the NYT was seen to shift out of predicatorial roles.
In [358]:
fg = plotter('Functions of risk words undergoing longitudinal shifts', inc.results[[0,1,-4,-3,-2,-1]], tex = True,
num_to_plot = 6, subplots = True, figsize = (7,5), layout = (3,2), show_p_val = True)
Next, let's try grouping these into systemic-functional categories. We can access wordlists corresponding to systemic categories as follows:
In [14]:
print roles._asdict().keys()
The code below converts the functions to systemic labels using editor()
. editor()
can receive either a results attribute as its main input, or a dict
object outputted by interrogator()
. In the case of the latter, it outputs another dict
object.
In [12]:
merges = {'Participant': roles.participant,
'Process': roles.process,
'Modifier': roles.circumstance + roles.epithet + roles.classifier}
sysfunc = editor(funct, merge_entries = merges, just_entries = merges.keys())
We can then plot a single newspaper by absolute or relative frequencies:
In [13]:
plotter('Systemic role of risk words in the WSJ', sysfunc['WSJ'].results)
rel_sysfunc = editor(sysfunc, '%', 'self', print_info = False)
plotter('Systemic role of risk words in the WSJ', rel_sysfunc['WSJ'].results)
Or, we can look at the behaviour of a given role in every paper. To do this, let's write a simple function that extracts an entry from each result and concatenates the output:
In [359]:
def entry_across_corpora(result_dict, entry_name, regex = False):
"""
get one entry from each newspaper and make a new dataframe
regex allows us to search by regular expression if need be
"""
import pandas as pd
import re
res = []
# for each corpus name and data
for k, v in sorted(result_dict.items()):
# grab the process result for each paper
if not regex:
try:
column = v.results[entry_name]
except:
continue
else:
column = v.results[[c for c in list(v.results.columns) if re.search(entry_name, c)]].iloc[:,0]
# rename it to the corpus name
column.name = k
# append to a list
res.append(column)
# concatenate and return
return pd.concat(res, axis = 1)
In [360]:
proc = entry_across_corpora(sysfunc, 'Process')
proc
Out[360]:
In [361]:
rel_proc = editor(proc, '%', riskwords)
plotter('Frequency of risk processes by newspaper', rel_proc.results, legend = 'or')
Well, that's rather hard to make sense of. A problem with this kind of analysis of risk as process, however, is that it misses risk processes where risk is a noun, not a verb:
1. They took a risk
2. They ran a risk
3. It posed a risk
4. They put it at risk
One of our search options, 'governor', can distinguish between these accurately. The query below shows us the function of risk words, and the lemma form of their governor.
In [ ]:
govrole = interrogator(all_corpora, 'g', riskword, lemmatise = True,
dep_type = 'collapsed-ccprocessed-dependencies', quicksave = '6_govrole')
We can now fix up our earlier count of risk by functional role. It's tricky, but shows the power of editor()
and pandas
:
In [234]:
# make a copy, to be safe
from copy import deepcopy
syscopy = deepcopy(sysfunc)
# for each corpus
for k, v in syscopy.items():
# calculate number to add to process count
are_proc = ['dobj:run', 'dobj:take', 'dobj:pose', 'nmod:at:put', 'prep_at:put']
add_to_proc = govrole[k].results[[i for i in are_proc if i in govrole[k].results.columns]].sum(axis = 1)
# calculate number to subtract from participant count
subtract_from_part = govrole[k].results[['dobj:run', 'dobj:take', 'dobj:pose']].sum(axis = 1)
# calculate number to subtract from modifier count
submod = ['prep_at:put', 'nmod:at:put']
subtract_from_mod = govrole[k].results[[i for i in submod if i in govrole[k].results.columns]].sum(axis = 1)
# do these calculations
v.results['Process'] = v.results['Process'] + add_to_proc
v.results['Participant'] = v.results['Participant'] - subtract_from_part
v.results['Modifier'] = v.results['Modifier'] - subtract_from_mod
In [235]:
print 'Uncorrected:'
print sysfunc['NYT'].results
print 'Corrected:'
print syscopy['NYT'].results
Let's look at the frequencies of each of these risk processes:
Now, let's plot more accurately, by role, and then by paper:
In [178]:
for role in ['Participant', 'Process', 'Modifier']:
df = entry_across_corpora(syscopy, role)
edi = editor(df, '%', get_totals(riskwords), print_info = False)
plotter('Frequency of risk as %s by newspaper' % role, edi.results)
In [247]:
for name, data in syscopy.items():
rel_data = editor(data.results, '%', 'self', print_info = False)
plotter('Functional roles of risk words in the %s' % name, rel_data.results, figsize = (5, 4))
In [248]:
collapsed(syscopy)
Out[248]:
In [252]:
tot = get_totals(riskwords)
for entry in ['Participant', 'Process', 'Modifier']:
en = entry_across_corpora(syscopy, entry)
rel_en = editor(en, '%', tot, print_info = False)
print entry
plotter('X', rel_en.results, subplots = True, figsize = (6,4), layout = (2,3))
We can see that basic trends observed in the NYT, away from risk processes and toward risk participants and modifiers, hold true to some extent amongst other mainstream U.S. publications. This is especially so in the case of risk-as-modifiers, which are increasingly common in every publication sampled.
That said, the trends are not always as clear cut in other newspapers as they are in the NYT.
In [264]:
renames = {'to risk': 'root:root',
'to run risk':'dobj:run',
'to take risk': 'dobj:take',
'to pose risk': 'dobj:pose',
'to put at risk': 'nmod:at:put'}
# nyt was parsed with a slightly different grammar.
# this standardises 'put at risk'
govrole['NYT'].results.rename(columns={'prep_at:put': 'nmod:at:put'}, inplace=True)
risk_processes = editor(govrole, replace_names = renames,
just_entries = renames.keys(), sort_by = 'total')
Let's take a look at what we have:
In [265]:
print risk_processes['WAP'].results
Displaying all this information properly is tricky. First, we can try collapsing distinctions between subcorpora---though this means that we can't observe longitudinal change:
In [266]:
out = []
for k, v in risk_processes.items():
data = v.results.sum(axis = 0)
data.name = k
out.append(data)
collapsed_years = pd.concat(out, axis = 1)
print collapsed_years
In [274]:
plotter('Risk processes: absolute frequency', collapsed_years, kind = 'bar', rot = False,
x_label = 'Publication', figsize = (8, 5))
rel_proc = editor(collapsed_years, '%', 'self', print_info = False)
plotter('Risk processes: relative frequency', rel_proc.results,
kind = 'bar', rot = False, x_label = 'Publication', figsize = (8, 5))
Or, we can collapse the distinction between newspapers. Perhaps we could make another function for this:
In [269]:
def collapsed(interrogation):
import pandas as pd
order = list(interrogation.values()[0].results.columns)
df = interrogation.values()[0].results
for i in interrogation.values()[1:]:
df = df.add(i.results, fill_value = 0)
df = df[order]
return df
In [273]:
print collapsed(risk_processes)
ed = editor(collapsed(risk_processes), '%', allwords.totals, print_info = False)
plotter('Longitudinal behaviour of risk processes in U.S. print media', ed.results, figsize = (9, 7))
We can see whether this pattern is similar for each newspaper:
In [243]:
for name, data in govrole.items():
res = editor(govrole[name].results, '%', syscopy[name].results['Process'], sort_by = 'name',
replace_names = renames, just_entries = renames.keys(), print_info = False)
plotter('Risk processes in %s' % name, res.results,
figsize = (6,4), legend_pos = 'outside right')
Let's look a little closer, and see whether the trend toward to put at risk
and away from to run risk
is reflected in every publication:
In [377]:
lst = []
renames = {#'to risk': 'root:root',
'to run risk':'dobj:run',
#'to take risk': 'dobj:take',
#'to pose risk': 'dobj:pose',
'to put at risk': r'(nmod:at:put|prep_at:put)'}
for n, i in renames.items():
ent = entry_across_corpora(govrole, i, regex = True)
rel_ent = editor(ent, '%', entry_across_corpora(syscopy, 'Process'), print_info = False)
print n
plotter(n, rel_ent.results, subplots = True, layout = (2, 3), figsize = (6, 5), save = True)
Indeed, some patterns seem quite regular. Running risk decreases in every publication, and put at risk increases.
We also found growth in the use of risk as a nominal pre-head modifier (risk factor, risk brokerage, etc.). We can use the same interrogation to find out what risk modifies in this way:
In [38]:
# this is a bit of a hack: delete nnmod: from names, then remove
# any entry with a ':' in it
nom_mod = editor(govrole, '%', riskwords, replace_names = r'^(nn|compound):',
skip_entries = r':', use_df2_totals = True)
In [390]:
inc_class = editor(collapsed(nom_mod), sort_by = 'increase', keep_stats = True, print_info = False)
dec_class = editor(collapsed(nom_mod), sort_by = 'decrease', keep_stats = True, print_info = False)
plotter('Risk as classifier', collapsed(nom_mod))
plotter('Risk as classifier, increasing', inc_class.results, show_p_val = True)
plotter('Risk as classifier, decreasing', dec_class.results, show_p_val = True)
Risk group
drops from extremely high prevalence due to its use during the beginning of the HIV/AIDS epidemic:
In [285]:
lines = conc('data/WAP/1987', 't', r'NP << /(?i)\brisk/ <<# /(?i)group/', print_output = False)
lines[['l', 'm', 'r']]
Out[285]:
In [396]:
for k, v in nom_mod.items():
plotter('Nouns modified by risk as classifier (%s)' % k, v.results, figsize = (7, 4), legend_pos = 'outside right')
In [366]:
plotter('\emph{Risk appetite}', entry_across_corpora(nom_mod, 'appetite'), legend_pos = 'upper left')
In [295]:
lines = conc('data/WSJ/2009', 't', r'NP < (/NN.?/ < /(?i)\brisk/) < (/NN.?/ < /(?i)\bappetit/)', print_output=False)
lines[['l', 'm', 'r']]
Out[295]:
Risk appetite is a good example of the increasing number of ways in which risk
is employed in the financial sector.
Still focussing on risk as a classifier, we can find out which words are increasing and decreasing the most over time:
In [ ]:
nom_mod_inc = editor(govrole, replace_names = r'(nn|compound):', skip_entries = ':',
sort_by = 'increase', print_info = False)
nom_mod_dec = editor(govrole, replace_names = r'(nn|compound):', skip_entries = ':',
sort_by = 'decrease', print_info = False)
In [298]:
nom_mod_inc['TBT'].results
Out[298]:
In [397]:
plotter('Nouns modified by \emph{risk} as classifier, increasing', collapsed(nom_mod_inc),
legend_pos = 'upper left', figsize = (9, 5), y_label = 'Absolute frequency')
One thing we noticed in our pilot investigation of the NYT was that while some adjectival risk words are declining in frequency (risky is a good example), others, like at-risk are becoming more prominent. We can check the other newspapers to see if the trend is general:
In [ ]:
risky = interrogator(all_corpora, 'count', r'/(?i)\brisk(y|ier|iest)/', quicksave = '6_risky')
In [31]:
# the overall frequency of risk as modifier
mod = entry_across_corpora(sysfunc, 'Modifier')
mod.sum(axis = 1)
Out[31]:
In [32]:
#print risky.results.sum(axis = 1)
#rwt = get_totals(riskwords)
#print rwt
rel_risky_sum = editor(risky.results.sum(axis = 1), '%', mod.sum(axis = 1))
rel_risky_sum.results
Out[32]:
In [33]:
plotter('Risky/riskier/riskiest', rel_risky.results)
Like we found in the NYT, there is an overall decrease. Let's take a look at the individual newspapers:
In [38]:
res = []
for paper in list(risky.results.columns):
ed = editor(risky.results[paper], '%', mod[paper], print_info = False)
res.append(ed.results)
concatted = pd.concat(res, axis = 1)
plotter('Relative frequency of risky/riskier/riskiest', concatted)
In [ ]:
atrisk = interrogator(all_corpora, 'count', r'/(?i)\bat-risk/', quicksave = '6_atrisk')
In [48]:
atrisk.results
Out[48]:
In [52]:
rel_atrisk_sum = editor(atrisk.results, '%', collapsed(syscopy)['Modifier'])
plotter('Relative frequency of \emph{at-risk} modifier', rel_atrisk_sum.results)
Now, let's split the corpora again:
In [59]:
plotter('Relative frequency of \emph{at-risk}', rel_atrisk_sum.results, subplots = True, figsize = (10, 5), layout = (2,3))
Interesting that both the peaks before 9/11 and the general rise are observable in most papers.
In our last investigation, we found that risk is increasingly occurring within complements and adjuncts, and less often within subject and finite/predicator positions. This was taken as evidence for decreasing arguability of risk in news discourse.
We can attempt to replicate that result using a previous interrogation.
In [ ]:
print roles
In [65]:
# we collapse the finite/predicator distinction because it's not very
# well handled by dependency parses. this is a pity, since finite plays
# a more important role in arguability than predicator.
merges = {'Subject': roles.subject,
'Finite/Predicator': roles.predicator,
'Complement': roles.complement,
'Adjunct': roles.adjunct}
moodrole = editor(funct, merge_entries = merges, just_entries = merges.keys())
In [368]:
rel_role = editor(collapsed(moodrole), '%', 'self', print_info = False)
#plotter('Mood role of risk words', rel_role.results)
plotter('Mood role of risk words', rel_role.results,
subplots = True, layout = (2,2), figsize = (7, 5))
This finding aligns with our pilot study, showing that risk is shifting from more arguable to less arguable positions within clauses.
The last thing we'll look at (for now) is the relationship beween risking and power.
In our previous analysis, we found that powerful people are much more likely to do risking.
To determine this, we needed to make two search queries. The first finds the nominal heads when risk/run risk/take risk is the process:
In [ ]:
query = r'/NN.?/ !< /(?i).?\brisk.?/ >># (@NP $ (VP <+(VP) (VP ( <<# (/VB.?/ < /(?i).?\brisk.?/) | <<# (/VB.?/ < /(?i)(take|taking|takes|taken|took|run|running|runs|ran)/) < (NP <<# (/NN.?/ < /(?i).?\brisk.?/))))))'
noun_riskers = interrogator(all_corpora, 'words', query, lemmatise = True, quicksave = '6_noun_riskers', num_proc = 3)
Next, we need to get the frequencies of nouns generally, so that we can account for the fact that some nouns are very common.
In [ ]:
query = r'/NN.?/ >># NP !< /(?i).?\brisk.?/'
noun_lemmata = interrogator(all_corpora, 'words', query, lemmatise = True, quicksave = '6_noun_lemmata')
Now, we can combine the lists:
In [150]:
# entities of interest
people = ['man', 'woman', 'child', 'baby', 'politician',
'senator', 'obama', 'clinton', 'bush']
# make summed versions of the noun_lemmata data
summed_n_lemmata = {}
for name, data in noun_lemmata.items():
summed_n_lemmata[name] = data.results.sum(axis = 0)
# calculate percentage of the time word is in
# risker position for each paper
# newpaper names
cols = []
# the results go here
res = []
# for each newspaper
for name, data in noun_riskers.items():
# collapse years
data = data.results.sum(axis = 0)
# make a new column
ser = {}
# for each risker
for i in list(data.index):
# if not a hapax
if summed_n_lemmata[name][i] < 2:
continue
# get its percentage
try:
sm = data[i] * 100.0 / summed_n_lemmata[name][i]
except:
continue
# add this to the data for this paper
ser[i] = sm
# turn the data into a column
as_series = pd.Series(ser)
# sort it
as_series.sort(ascending = False)
# add it to a master list
res.append(as_series)
# add the newspaper name
cols.append(name)
# put data together into spreadsheet
df = pd.concat(res, axis = 1)
# name newspapers again
df.columns = cols
# just named entries
df_sel = df.T[people].T
# show us what it looks like
print df_sel
# sort by frequency
sort_by = list(df_sel.sum(axis = 1).sort(ascending = False, inplace = False).index)
df_sel = df_sel.T[sort_by].T
# visualise
plotter('Risk and power', df_sel, kind = 'bar', figsize = (12, 5),
x_label = 'Entity', y_label = 'Risker percentage')
plotter('Risk and power', df_sel.T, kind = 'bar', figsize = (12, 5), num_to_plot = 'all',
x_label = 'Publication', y_label = 'Risker percentage')
We can see here that each paper construes powerful people as riskers. Also interesting is that entities favoured by the political orientation of the newspapers are more often positioned as riskers.
Finally, we can look longitudinally, to see if there are more or fewer riskers in general:
In [166]:
lemmata_tots = {}
for name, data in noun_lemmata.items():
lemmata_tots[name] = data.totals
output = []
for name, data in noun_riskers.items():
tot = data.totals
tot = tot * 100.0 / lemmata_tots[name]
tot.name = name
output.append(tot)
df = pd.concat(output, axis = 1)
plotter('Percentage of noun lemmata in the risker position', df)
plotter('Percentage of noun lemmata in the risker position', df, kind = 'area', reverse_legend = False)
So, interestingly, there are fewer and fewer grammatical riskers in mainstream U.S. print news.