The larger query has examined the presence and rate of occurence of certain disease names in corpus. Here we want 1) to look at this data, 2) graph it per year, and eventually 3) compare these diseases. We will also need to 4) normalise the data, to account for increased publishing activity over time.
In [2]:
import yaml
import matplotlib.pyplot as plt
disease = 'consumption'
filename = 'diseases_20150616_1139_38.yml'
Here, change the search term and the result file by commenting out the ones you don't want to use
In [3]:
disease = 'cholera'
filename = 'diseases_20150616_1139_9.yml'
disease = 'measles'
filename = 'diseases_20150616_1139_9.yml'
disease = 'whooping'
filename = 'diseases_20150616_1139_43.yml'
In [4]:
with open("data/" + filename, 'r') as f:
results = yaml.load(f)
Take a look at the data for the disease type we're interested in
In [6]:
results[disease]
Out[6]:
Plot this out by year
In [7]:
plt.plot(results[disease].keys(), results[disease].values(), 'x')
Out[7]:
In [19]:
from bokeh.plotting import figure, output_file, show
In [102]:
output_file("consumption.html", title="Number of books referencing " + disease + " by year")
p = figure(title= disease + " references", x_axis_label='Year', y_axis_label='Number of books')
p.line(results[disease].keys(), results[disease].values())
show(p)
In [8]:
normal_filename = 'normaliser_20150616_1844.yml'
with open('data/' + normal_filename, 'r') as f:
publication = yaml.load(f)
These stages carry out normalisation: dividing the per year word occurence with the per year book occurence to get a words per book per year measure.
With new data we can normalise this as a ratio of words/word
In [15]:
normed_results = {}
for year in results[disease]:
if year>0:
normed_results[year] = results[disease][year]/float(publication[year][0])
In [17]:
plt.plot(normed_results.keys(), normed_results.values(), 'x')
Out[17]:
In [22]:
output_file(disease + ".html", title="Proportion of books referencing " + disease + " by year")
p = figure(title= disease + " references", x_axis_label='Year', y_axis_label='Number of books')
p.scatter(normed_results.keys(), normed_results.values())
show(p)
In [ ]: