SUMMARY: This IPython Notebook demonstrates the methodology I used to extract findings from a corpus of posts to a bipolar disorder online support group. The tools and methods here can easily be applied to corpus linguistic and digital humanities research, and can also be used as an introduction to Python or natural language processing.
If you haven't already done so, the first things we need to do are install corpkit, download data for NLTK's tokeniser, and unzip our corpus.
In [ ]:
# install corpkit with either pip or easy_install
try:
import pip
pip.main(['install', 'corpkit'])
except ImportError:
import easy_install
easy_install.main(["-U","corpkit"])
In [ ]:
# download nltk tokeniser data
import nltk
nltk.download('punkt')
In [ ]:
# unzip and untar our data
! gzip -dc data/bipolar.tar.gz | tar -xf - -C data
In [207]:
print 'hello everyone'
Great! Now we have everything we need to start.
Let's first perform a quickstart, just so we know what we're getting into.
In [5]:
# plot our figures in this window
%matplotlib inline
# import everything we need
import corpkit
from corpkit import interroplot
# path to data
path = 'data/postcounts'
# words to find
query = ['meds', 'diagnosed', 'support', 'bipolar', 'pdoc', 'board']
# search and plot
interroplot(path, query)
On the x-axis as groups of posts. Group 1 is comprised of all first posts. Group 2 contains second and third posts. This progresses until group 10, where 560th posts and above are stored. We'll go into this in more depth later.
Here, the y-axis is measuring the result as a percentage of all the results in that year.
We can see here how certain lexical items are more often used by new than veteran members, and vice versa. Interestingly, bipolar itself drops in relative frequency quite a lot! By the end of this Notebook, we should have some reasons why this is so.
First, let's import the functions we'll be using to investigate the corpus. These functions are designed for this interrogation, but also have more general use in mind, so you can likely use them on your own corpora.
Function name | Purpose | |
---|---|---|
interrogator() |
interrogate parsed corpora | |
editor() |
edit interrogator() results |
|
plotter() |
visualise interrogator() results |
|
quickview() |
view interrogator() results |
|
multiquery() |
run a list of interrogator() queries |
|
conc() |
complex concordancing of subcorpora | |
keywords() |
get keywords and ngrams from conc() output, subcorpora |
|
collocates() |
get collocates from conc() output, subcorpora |
|
quicktree() |
visually represent a parse tree | |
searchtree() |
search a parse tree with a Tregex query | |
save_result() |
save a result to disk | |
load_result() |
load a saved result | |
load_all_results() |
load every saved result into a dict |
In [206]:
%matplotlib inline
import corpkit
from corpkit import *
import pandas as pd
# r = load_all_results()
def show(lines, index, show = 'thread'):
url = lines.ix[index]['link'].replace('<a href=', '').replace('>link</a>', '')
return HTML('<iframe src=%s width=1000 height=500></iframe>' % url)
The first thing we need to do is set a path to our corpus. If you have different data, you can change this.
In [2]:
corpus = 'data/postcounts' # path to corpora
Let's also quickly set some options for displaying raw data:
In [3]:
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 10)
pd.set_option('max_colwidth',70)
pd.set_option('display.width', 1000)
pd.set_option('expand_frame_repr', False)
The data is every post to a 10 year old online support group for bipolar disorder. The support group contains mostly people who have bipolar disorder, but also present are the friends and families of people living with the condition.
Like many forums, this one is characterised by a high dropout rate, with most users having a total postcount under five.
In [195]:
from data.tallies import postcounter
num_posts = postcounter("data/postcounts")
In [199]:
plotter('Number of users by number of posts', num_posts, logx = True, x_label = "User's total post count",
y_label = 'Number of users', legend = False, kind = 'area', figsize = (16, 7))
To create the corpus, each page of the forum was downloaded and stored. Posts were extracted using Beautiful Soup, and were grouped into ten subcorpora representing different stages of membefunct. The first subcorpus contains all first posts. The second contains all 2nd and 3rd posts, and so forth. In the final subcorpus are all 560th posts and above. Each is approximately equal size.
Post group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Post range | 1st post | 2-3 posts | 4-7 posts | 8-15 posts | 16-30 posts | 31-58 posts | 59-115 posts | 116-219 posts | 220-559 posts | 560+ posts |
# texts | 5818 | 5689 | 5607 | 5937 | 5790 | 5875 | 5848 | 5757 | 5789 | 5570 |
# users | 5818 | 3348 | 1777 | 1004 | 529 | 284 | 148 | 76 | 38 | 8 |
Wordcount | 1135520 | 832198 | 827893 | 853123 | 828538 | 831341 | 803965 | 870124 | 755476 | 737896 |
The texts have been parsed for part of speech and grammatical structure by Stanford CoreNLP. In this notebook, we are only working with the parsed versions of the texts. It's definitely worthwhile to learn the Tregex syntax. If not, at the end of this notebook are a series of Tregex queries that you can copy and paste in.
The interrogator()
and conc()
functions rely on Tregex to interrogate the corpora. Tregex allows very complex searching of parsed trees, in combination with Java Regular Expressions.
So, let's generate some general information about this corpus. First, let's define a query to find every word in the corpus. Run the cell below to define the allwords_query
variable as the Tregex query to its right.
When writing Tregex queries or Regular Expressions, remember to always use
r'...'
quotes!
In [28]:
# any token containing letters or numbers (i.e. no punctuation):
allwords_query = r'/[A-Za-z0-9]/ !< __'
Corpus interrogation is handled by interrogator()
. Its most important arguments are:
There are many kinds of search options available:
Option | Function |
---|---|
b |
get tag and word of Tregex match |
c |
count Tregex match |
d |
get dependent of regular expression match and the r/ship |
f |
get dependency function of regular expression match |
g |
get governor of regular expression match and the r/ship |
i |
get dependency index of regular expression match |
k |
Find keywords |
n |
find n-grams |
p |
get part-of-speech tag with Tregex |
r |
regular expression, for plaintext corpora |
s |
simple search string or list of strings for plaintext corpora |
w |
get word(s) returned by Tregex/keywords/ngrams |
Right now, we only need to count tokens, so we can use the c
option. The cell below will run interrogator()
over each subcorpus and count the number of matches for the query.
In [69]:
allwords = interrogator(corpus, 'c', allwords_query, quicksave = 'allwords')
When the interrogation has finished, we can view our results:
In [29]:
# from the allwords results, print the totals
allwords.totals
Out[29]:
If you want to see the query and options that created the results, you can use:
In [30]:
allwords.query
Out[30]:
Lists of years and totals are pretty dry. Luckily, we can use the plotter()
function to visualise our results, and the table function to table them. At minimum, plotter()
needs two arguments:
In [4]:
plotter('Word counts in each subcorpus', allwords.totals, figsize = (16, 7), kind = 'area')
Great! So, we can see that the number of words per postgroup actually varies quite a lot. That's worth keeping in mind.
Let's get to know the functions available to us by performing a pilot study of the use of medication names in the community. If we simply create a list of words, corpkit()
will automatically create a Tregex query to match any word in the list:
In [33]:
# our query:
med_words_query = ['lithium', 'lexapro', 'prozac', 'ssri', 'quetiapine', 'seroquel',
'depakote', 'risperdol', 'risperidone', 'depakene', 'lamictal',
'klonopin', 'abilify', 'geodon', 'topomax', 'zyprexa', 'tegretol',
'carbamazepine']
med_words = interrogator(corpus, 'words', med_words_query, quicksave = 'med_words')
Even when do not use the count
option, we can access the total number of matches as before:
In [6]:
plotter('Medication words', med_words.totals, figsize = (16, 7))
At the moment, it's hard to tell whether or not these counts are simply because our postcount samples are different sizes. To account for this, we can calculate the percentage of parsed words that are medication words. This means combining the two interrogations we have already performed.
We can do this by using editor()
:
In [186]:
rel_medwords = editor(med_words.results, '%', allwords.totals)
plotter('Relative frequency of medication words', rel_medwords.totals, num_to_plot = 'all')
In [189]:
plotter('Relative frequency of medication words', rel_medwords.results, legend_pos = 'o r',
style = 'bmh', figsize = (14, 6))
That's more helpful. We can now see that first posts contain the highest relative frequency of medication words.
Perhaps we're interested in not only the relative frequency of medication words, but the frequency with which the different medications are mentioned. We actually already collected this data during our last interrogator()
query.
We can print just the top five results in a number of different ways:
In [81]:
# 1. pandas syntax
rel_medwords.results.iloc[:,0:5]
Out[81]:
In [82]:
# 2: pandas syntax
rel_medwords.results[rel_medwords.results.columns[:5]]
Out[82]:
In [83]:
# loop plus pandas syntax
for col in rel_medwords.results.columns[:5]:
print rel_medwords.results[col]
In [85]:
# a corpkit function
quickview(rel_medwords, 5)
With all of this data, we can now do some serious plotting.
In [191]:
perc_medwords = editor(med_words.results, '%', allwords.totals)
plotter('Medication word / all medication words', rel_medwords.results, figsize = (12, 6))
# selecting a slice using some pandas syntax
plotter('Medication word / all words', perc_medwords.results.iloc[0:5,5:10], figsize = (12, 6))
We can use plotter()
arguments to customise what our chart shows.
plotter()
's possible arguments are:
plotter() argument |
Mandatory/default? | Use | Type |
---|---|---|---|
title |
mandatory | A title for your plot | string |
results |
mandatory | the results you want to plot | interrogator() or editor() output |
num_to_plot |
7 | Number of top entries to show | int |
x_label |
False | custom label for the x-axis | str |
y_label |
False | custom label for the y-axis | str |
figsize |
(13, 6) | set the size of the figure | tuple: (length, width) |
tex |
'try' |
use TeX to generate image text | boolean |
style |
'ggplot' |
use Matplotlib styles | str: 'dark_background' , 'bmh' , 'grayscale' , 'ggplot' , 'fivethirtyeight' |
legend_pos |
'default' |
legend position | str: 'outside right' to move legend outside chart |
show_totals |
False |
Print totals on legend or plot where possible | str: 'legend ', 'plot ', 'both ', or 'False' |
save |
False |
Save to file | True : save as title .png. str: save as str |
colours |
'Paired' |
plot colours | str: any of Matpltlib's colormaps |
cumulative |
False |
plot entries cumulatively | bool |
**kwargs |
False | pass other options to Pandas plot/Matplotlib | rot = 45 , subplots = True , fontsize = 16 , etc. |
You can easily use these to get different kinds of output. Try changing some parameters below:
In [ ]:
plotter('Relative frequencies of medication words', rel_medwords.results, y_label = 'Percentage of all medication words', num_to_plot = 5, style = 'fivethirtyeight', legend_pos = 'lower left')
Another neat thing you can do is save the results of an interrogation, so they don't have to be run the next time you load this notebook:
In [ ]:
# specify what to save, and a name for the file.
save_result(allwords, 'allwords')
You can then load these results:
In [ ]:
fromfile_allwords = load_result('allwords')
fromfile_allwords.totals
You can also load every saved interrogation into a dictionary:
In [ ]:
r = load_all_results()
quickview()
is a function that quickly shows the n most frequent items in a list. Its arguments are:
interrogator()
result
In [203]:
quickview(med_words, n = 10)
For concordancing, there is conc()
, which produces concordances of a subcorpus based on Tregex queries. Its arguments are:
In [204]:
lines = conc('data/postcounts/10', ['angry', 'mad', 'upset'], random = True, n = 20)
conc()
automatically prints concordance lines alongside their index, so that you can easily manipulate them.
In [ ]:
lines = conc('data/postcounts/4', r'/VB.?/ << /(?i).?\bhelp.*?\b/',
random = True, window = 50)
# > **Tip:** You can concordance any `interrogator()` query, to make sure the expected things are being matched.
There are also functions for keywording, ngramming and collocation. Currently, these work with csv output from conc()
. keywords()
produces both keywords and ngrams. It relies on code from the Spindle project.
In [ ]:
keys, ngrams = keywords('data/postcounts/01', dictionary = 'bnc.p')
for key in keys[:10]:
print key
for ngram in ngrams:
print ngram
In [ ]:
keys, ngrams = keywords('data/postcounts/01', dictionary = 'bnc.p')
for key in keys[:10]:
print key
for ngram in ngrams:
print ngram
With the collocates()
function, you can specify the maximum distance at which two tokens will be considered collocats.
In [ ]:
Now you're familiar with the corpus and functions. Before beginning the corpus interrogation, let's also learn a bit about Systemic Functional Linguistics---the theory of language that underlies my analytical approach.
Functional linguistics is a research area concerned with how realised language (lexis and grammar) work to achieve meaningful social functions. One functional linguistic theory is Systemic Functional Linguistics, developed by Michael Halliday.
In [ ]:
from IPython.display import HTML
HTML('<iframe src=http://en.mobile.wikipedia.org/wiki/Michael_Halliday?useformat=mobile width=700 height=350></iframe>')
Central to the theory is a division between experiential meanings and interpersonal meanings.
Halliday argues that these two kinds of meaning are realised simultaneously through different parts of English grammar.
Here's one visualisation of it. We're concerned with the two left-hand columns. Each level is an abstraction of the one below it.
Transitivity choices include fitting together configurations of:
As in the following example:
I | can wear | the same outfit | all weekend |
---|---|---|---|
Participant | Process | Participant | Circumstance |
Mood features of a language include:
Lexical density is usually a good indicator of the general tone of texts. The language of academia, for example, often has a huge number of nouns to verbs. We can approximate an academic tone simply by making nominally dense clauses:
The consideration of interest is the potential for a participant of a certain demographic to be in Group A or Group B.
Notice how not only are there many nouns (consideration, interest, potential, etc.), but that the verbs are very simple (is, to be).
In comparison, informal speech is characterised by smaller clauses, and thus more verbs.
A: Did you feel like dropping by?
B: I thought I did, but now I don't think I want to
Here, we have only a few, simple nouns (you, I), with more expressive verbs (feel, dropping by, think, want)
Like transitivity groups and phrases,
Within the clause, mood constituents are the subject, finite, predicator, complement, and adjunct.
I | can | wear | the same outfit | all weekend |
---|---|---|---|---|
Subject | Finite | Predicator | Complement | Adjunct |
Note that these do not always coincide perfectly with transitivity annotations.
Note: SFL argues that through grammatical metaphor, one linguistic feature can stand in for another. Would you please shut the door? is an interrogative, but it functions as a command. invitation is a nominalisation of a process, invite. We don't have time to deal with these kinds of realisations, unfortunately.
A discourse analysis that is not based on grammar is not an analysis at all, but simply a running commentary on a text. - M.A.K. Halliday, 1994
Sorry, because of substantial revisions to corpkit
, this part is still under construction.
In [39]:
query = ([u'Declarative', r'S < (NP $++ VP)'],
[u'Interrogative', r'ROOT !<<, (MD < __ ) ( < /(SBARQ|SINV|SQ)/ | << (/\?/ !< __))'],
[u'Imperative', r'VP !<1 (__ < /(?i)\b(thank|hello|hey|hi|am|had)\b/) !<1 /(?i)\bto\b/ !<<# /\b(VBG|VBN|VBZ|VBD)\b/ >1 (S !>> VP > (ROOT !<< /\?/))'],
[u'Modalised interrogative', r'ROOT <<, (MD < __ ) ( < /(SBARQ|SINV|SQ)/ | << (/\?/ !< __))'])
moods = multiquery(corpus, query)
In [118]:
clauses = r['clauses']
rel_moods = editor(moods.results, '%', moods.totals)
plotter('Moods', rel_moods.results, figsize = (12, 10), style = 'fivethirtyeight', subplots = True)
In [35]:
modals_unlemmatised = interrogator(corpus, 'words', 'modals', lemmatise = False)
In [27]:
modals = interrogator(corpus, 'words', 'modals', lemmatise = True)
In [36]:
rel_modals_unlemmatised = editor(modals_unlemmatised.results, '%', modals_unlemmatised.totals)
rel_modals2_unlemmatised = editor(modals_unlemmatised.results, '%', allwords.totals)
rel_modals = editor(modals.results, '%', modals.totals)
rel_modals2 = editor(modals.results, '%', allwords.totals)
In [202]:
#plotter('Modalisation over the course of membership (unlemmatised)', rel_modals_unlemmatised.results,
# figsize = (14, 6), style = 'bmh')
plotter('Modalisation over the course of membership (unlemmatised)', rel_modals2_unlemmatised.results,
figsize = (14, 6), style = 'bmh')
#plotter('Modalisation over the course of membership (lemmatised)', rel_modals.results,
# figsize = (14, 6), style = 'bmh')
plotter('Modalisation over the course of membership (lemmatised)', rel_modals2.results,
figsize = (14, 6), style = 'bmh')
In [18]:
i_would_adjunct = r'MD < /(?i)would/ > (VP << RB $ (NP <<# /(?i)^i$/))'
In [20]:
lines = conc('data/postcounts/01', i_would_adjunct, n = 30, random = True, window = 50)
In [24]:
from dictionaries.process_types import processes
# i_would_adjunct = r'MD < /(?i)would/ > (VP << RB $ (NP <<# /(?i)^i$/))'
i_would_adjunct_mental = r'MD < /(?i)would/ > (VP ( <+(VP) (VP < (/VB.?/ < /%s/) !< VP)) << RB $ (NP <<# /(?i)^i$/))' % processes.mental
In [25]:
lines = conc('data/postcounts/01', i_would_adjunct_mental, n = 30, random = True, window = 50)
In [121]:
from dictionaries.process_types import processes
# i_would_adjunct = r'MD < /(?i)would/ > (VP << RB $ (NP <<# /(?i)^i$/))'
i_would_not_mental = r'MD < /(?i)would/ > (VP ( <+(VP) (VP < (/VB.?/ !< /%s/ !< /%s/) !< VP)) << RB $ (NP <<# /(?i)^i$/))' % (processes.verbal, processes.mental)
In [122]:
lines = conc('data/postcounts/01', i_would_not_mental, n = 30, random = True, window = 50)
In [201]:
In [1]:
from corpkit import *
%matplotlib inline
govs = load_result('bp_govrole_lem_collcc')
In [2]:
govs.query
Out[2]:
In [127]:
govs.results
Out[127]:
In [6]:
#govs = interrogator(corpus, 'g', bp_words, lemmatise = True)
from dictionaries.process_types import processes
%matplotlib inline
renames = [(r'root:root', 'be bipolar'),
(r'(dobj|acomp):have', 'have bipolar'),
(r'dobj:.*', 'other processes')]
be_have = editor(govs.results, '%', 'self', sort_by = 'total',
replace_names = renames, just_entries = [n for r, n in renames])
plotter('Being and having bipolar', be_have.results, figsize = (10, 5), save = True, style = 'fivethirtyeight',
y_label = 'Percentage of all processes')
In [133]:
be_have = editor(govs.results, merge_entries = ['dobj:have', 'acomp:have'], newname = 'have bipolar')
be_have = editor(be_have.results, merge_entries = ['root:root'], newname = 'be bipolar')
be_have = editor(be_have.results, merge_entries = r'dobj:%s' % processes.relational, newname = 'other relational processes')
be_have = editor(be_have.results, '%', be_have.totals, sort_by = 'total',
just_entries = ['have bipolar', 'be bipolar', 'other relational processes'])
plotter('Being and having bipolar', be_have.results, num_to_plot = 3, y_label = 'Percentage of all relational processes',
style = 'fivethirtyeight', figsize = (14, 6))
In [45]:
part_query = r'/(NN|JJ).?/ >># (/(NP|ADJP)/ $ VP | > VP)'
parts = interrogator(corpus, 'words', part_query, lemmatise = True)
In [46]:
quickview(parts, 50)
Plotting by total:
In [114]:
tot_part = editor(parts.results, '%', parts.totals, sort_by = 'total')
plotter('Participants, increasing', tot_part.results, figsize = (10, 5),
style = 'fivethirtyeight', num_to_plot = 20, interactive=True)
Out[114]:
Plotting by decreasing frequency:
In [63]:
# dec_part = editor(parts.results, '%', parts.totals, sort_by = 'decrease')
plotter('Participants, decreasing', dec_part.results, figsize = (16, 7), num_to_plot = 10,
style = 'fivethirtyeight')
Plotting by increasing frequency:
In [58]:
inc_part = editor(parts.results, '%', parts.totals, sort_by = 'increase')
plotter('Participants, increasing', inc_part.results, figsize = (16, 7), num_to_plot = 10,
style = 'fivethirtyeight', kind = 'area')
First, we copy our results a few times:
In [ ]:
In [156]:
We then make dictionaries of jargon/non-jargon, and their search queries:
In [157]:
jargon_terms = {'pdoc': r'^pdoc',
'tdoc': r'^tdoc',
'meds': r'^med$'}
non_jargon_terms = {'psychologist/psychiatrist': r'^(psychologist|psychiatrist)',
'diagnos*': r'^diagnos',
'therapist': r'^therapist',
'medication/medicine': r'^(medicine|medication)s*',
'doc/doctor': r'^doc(tor)*s*$'}
Now we iterate through these items, merging things in the coped lists:
In [158]:
for name, regex in jargon_terms.items():
jargon = editor(jargon.results, merge_entries = regex, newname = name)
jargon = editor(jargon.results, '%', parts.totals, just_entries = jargon_terms.keys())
for name, regex in non_jargon_terms.items():
non_jargon = editor(non_jargon.results, merge_entries = regex, newname = name)
non_jargon = editor(non_jargon.results, '%', parts.totals, just_entries = non_jargon_terms.keys())
In [159]:
plotter('Jargon terms', jargon.results, subplots = True, style = 'fivethirtyeight', figsize = (9, 9))
plotter('Non-jargon terms', non_jargon.results, subplots = True, style = 'fivethirtyeight', figsize = (9, 9))
In [184]:
parts = r['participants']
proc = r['processes']
allwords = r['allwords']
nms = {'diagnose as participant': r'^diagno*',
'dx as participant': r'^dx',
'diagnose as process': r'^diagno',
'dx as process': r'^dx'}
diag = []
for name, regex in nms.items()[:2]:
tmp = editor(parts.results, merge_entries = regex, newname = name)
diag.append(tmp.results[name])
for name, regex in nms.items()[2:]:
tmp = editor(proc.results, merge_entries = regex, newname = name)
diag.append(tmp.results[name])
diag = pd.concat(diag, axis = 1)
diagnosis = editor(diag, '%', allwords.totals, just_entries = nms.keys())
plotter('Diagnosis as jargon process', diagnosis.results, subplots = True,
figsize = (10, 10))
In [75]:
what_i_do = r['what_i_do_deps']
what_i_do.query
Out[75]:
In [78]:
tot_proc = editor(what_i_do.results, '%', what_i_do.totals, sort_by = 'total')
plotter('Processes with first person subject as actor', tot_proc.results, num_to_plot = 10,
style = 'bmh', figsize = (16, 7))
In [87]:
inc_proc = editor(what_i_do.results, '%', what_i_do.totals, sort_by = 'increase')
plotter('Processes with first person subject as actor', inc_proc.results, num_to_plot = 10,
style = 'bmh', figsize = (16, 7), legend_pos = 'upper left')
In [77]:
dec_proc = editor(what_i_do.results, '%', what_i_do.totals, sort_by = 'decrease')
plotter('Processes with first person subject as actor, decreasing', dec_proc.results, num_to_plot = 10,
style = 'bmh', figsize = (16, 7))
We can also get results with the least slope:
In [88]:
stat = editor(what_i_do.results, keep_top = 100, print_info = False)
stat_proc = editor(stat.results, '%', what_i_do.totals, sort_by = 'static')
plotter('Processes with first person subject as actor, decreasing', stat_proc.results, num_to_plot = 10,
style = 'bmh', figsize = (16, 7))
In [69]:
what_you_do = r['what_you_do_deps']
what_you_do.query
Out[69]:
In [74]:
inc_proc_2 = editor(what_you_do.results, '%', what_you_do.totals, sort_by = 'increase')
plotter('Processes with first person subject as actor, increasing', inc_proc_2.results, num_to_plot = 10,
style = 'bmh', figsize = (16, 7))
In [73]:
dec_proc_2 = editor(what_you_do.results, '%', what_you_do.totals, sort_by = 'decrease')
plotter('Processes with first person subject as actor, decreasing', dec_proc_2.results, num_to_plot = 10,
style = 'bmh', figsize = (16, 7))
In [96]:
from dictionaries.process_types import processes
proc_types_query = [('Relational processes', r'/VB.?/ < /%s/ >># ( VP >+(VP) (VP !> VP $ NP))' % processes.relational),
('Mental processes', r'/VB.?/ < /%s/ >># ( VP >+(VP) (VP !> VP $ NP))' % processes.mental),
('Verbal processes', r'/VB.?/ < /%s/ >># ( VP >+(VP) (VP !> VP $ NP))' % processes.verbal)]
proc_types = multiquery(corpus, proc_types_query)
In [193]:
proc_total = r['processes']
rel_proc_types = editor(proc_types.results, '%', proc_total.totals)
plotter('Process types', rel_proc_types.results, kind = 'line', figsize = (11, 6),
style = 'fivethirtyeight', interactive = True)
Out[193]:
In [24]:
#part_in_diag = interrogator(corpus, 'd', query = r'(?i)\bdiagno',
#lemmatise = True, dep_type = 'collapsed-ccprocessed-dependencies')
part_in_diag = load_result('diagnose_coll_deps')
# get only roles that equate to participant and process
to_match = r'^(dobj|nsubj|nsubjpass|csubj|acomp|iobj|csubjpass|tmod|agent|advmod|prep_[a-z]*):'
parts_circs = editor(part_in_diag.results, just_entries = to_match)
# edit names to sfl categories
# normalise bipolar and dr words
roles = [(r':(bp|bipolar|ii|bi-polar).*', ':bipolar'),
(r':(doc|pdoc|tdoc|psychiatrist|dr).*', ':doctor'),
(r':(you|yourself|yourselves).*', ':you'),
(r'^(nsubj|agent):', 'actor:'),
(r'^(nsubjpass|dobj):', 'goal:'),
(r'^prep_by:', 'actor:'),
(r'(prep_[a-z]*|acomp):', 'range:'),
(r'^tmod:', 'circ:'),
(r'^advmod:', 'circ:'),
(r'goal:bipolar', 'range:bipolar')]
#(r'goal:(?!(i|you)).*', 'goal:3rdperson'),
#(r'actor:(?!(i|you)).*', 'actor:3rdperson')]
parts_circs = editor(parts_circs.results, replace_names = roles,
merge_subcorpora = ['06', '07', '08', '09', '10'], new_subcorpus_name = '6+')
In [26]:
p = editor(parts_circs.results, '%', 'self', just_entries = r'(actor|goal|range):',
sort_by = 'total', just_subcorpora = ['01', '6+'], keep_stats = False, keep_top = 12)
p.results.rename(index={'01': 'First posts', '6+': 'Veteran posts (groups 6$+$)'},
inplace = True)
p = editor(p.results, replace_names = [(':', r'\\textbf{: '), (r'$', r'}')]) # $
plotter('Participants selected by \emph{diagnose} as Event', p.results.T.head(7),
kind = 'bar', y_label = 'Percentage of all \emph{diagnose} circumstances',
x_label = 'Word', figsize = (10, 5), rot = 0, save = True)
In [15]:
# get relative frequencies of circumstances
c = editor(parts_circs.results, just_entries = r'^circ:',
sort_by = 'total', just_subcorpora = ['01', '6+'], keep_stats = False)
c = editor(c.results, '%', c.totals, replace_names = r'^circ:')
plotter(r'Common circumstances of \emph{diagnose} processes in first posts',
c.results.ix['01'].order(ascending=False), kind = 'bar', num_to_plot = 13,
figsize = (10, 5), show_totals = 'plot', x_label = 'Participant', save = True,
y_label = 'Percentage of all circ. in \emph{diagnose} process')
# a new chart, comparing first posts to veteran posts
key_terms = ['recently', 'just', 'ago', 'now', 'year', 'week',
'properly', 'correctly', 'officially', 'accurately', 'here']
selected_c = editor(c.results, just_entries = key_terms,
just_subcorpora = ['01', '6+'], sort_by = 'decrease')
selected_c.results.rename(index={'01': 'First posts', '6+': 'Veteran posts (6$+$)'}, inplace = True)
plotter('Circumstances surrounding the process of diagnosis', selected_c.results.T, num_to_plot = 'all',
kind = 'bar', y_label = 'Percentage of all \emph{diagnose} circumstances', x_label = 'Word',
figsize = (10, 5), save = True)
In [16]:
c.results
Out[16]:
In [17]:
#part_in_diag = interrogator(corpus, 'd', query = r'(?i)\bdiagno',
#lemmatise = True, dep_type = 'collapsed-ccprocessed-dependencies')
part_in_diag = load_result('diagnose_coll_deps')
# get only roles that equate to participant and process
to_match = r'^(dobj|nsubj|nsubjpass|csubj|acomp|iobj|csubjpass|tmod|agent|advmod|prep_[a-z]*):'
parts_circs = editor(part_in_diag.results, just_entries = to_match)
# edit names to sfl categories
# normalise bipolar and dr words
roles = [(r':(bp|bipolar|ii|bi-polar).*', ':bipolar'),
(r':(doc|pdoc|tdoc|psychiatrist|dr).*', ':doctor'),
(r':(you|yourself|yourselves).*', ':you'),
(r'^(nsubj|agent):', 'actor:'),
(r'^(nsubjpass|dobj):', 'goal:'),
(r'^prep_by:', 'actor:'),
(r'(prep_[a-z]*|acomp):', 'range:'),
(r'^tmod:', 'circ:'),
(r'^advmod:', 'circ:'),
(r'goal:bipolar', 'range:bipolar')]
#(r'goal:(?!(i|you)).*', 'goal:3rdperson'),
#(r'actor:(?!(i|you)).*', 'actor:3rdperson')]
parts_circs = editor(parts_circs.results, replace_names = roles)
In [18]:
# get relative frequencies of circumstances
c = editor(parts_circs.results, just_entries = r'^circ:',
sort_by = 'total', keep_stats = False)
c = editor(c.results, '%', c.totals, replace_names = r'^circ:')
# a new chart, comparing first posts to veteran posts
inc_c = editor(c.results, sort_by = 'increase')
In [21]:
list(inc_c.results.columns)[:30]
Out[21]:
In [ ]:
In [ ]:
In [ ]:
key_terms = ['recently', 'just', 'ago', 'now', 'year', 'week',
'properly', 'correctly', 'officially', 'accurately', 'here']
selected_c = editor(c.results, just_entries = key_terms,
just_subcorpora = ['01', '6+'], sort_by = 'decrease')
selected_c.results.rename(index={'01': 'First posts', '6+': 'Veteran posts (6$+$)'}, inplace = True)
plotter('Circumstances surrounding the process of diagnosis', selected_c.results.T, num_to_plot = 'all',
kind = 'bar', y_label = 'Percentage of all \emph{diagnose} circumstances', x_label = 'Word',
figsize = (14, 7))
In [ ]:
In [ ]:
In [ ]:
In [29]:
#proc = load_result('tree_processeslem')
#query = r'/VB.?/ >># ( VP >+(VP) (VP !> VP $ NP))'
#proc = interrogator(corpus, 'words', query, lemmatise = True)
p = editor(proc.results.ix[0], 'k', proc.results)
p.results
Out[29]:
In [20]:
from IPython.display import HTML
HTML('<iframe src=https://en.wikipedia.org/wiki/ETH_Zurich width=900 height=350></iframe>')
Out[20]:
In [12]:
lines = conc('data/postcounts/10', r'/crazy/', window = 50, n = 50, add_links = True)
In [14]:
lines.ix[0]
Out[14]:
In [18]:
show(lines, 0, show = 'thread')
Out[18]:
In [16]: