The larger query has examined the presence and rate of occurence of certain disease names in corpus. Here we want 1) to look at this data, 2) graph it per year, and eventually 3) compare these diseases. We will also need to 4) normalise the data, to account for increased publishing activity over time.
In [3]:
import yaml
import matplotlib.pyplot as plt
disease = 'consumption'
filename = 'diseases_20150616_1139_38.yml'
In [4]:
disease = 'cholera'
filename = 'diseases_20150616_1139_9.yml'
disease = 'measles'
filename = 'diseases_20150616_1139_9.yml'
disease = 'whooping'
filename = 'diseases_20150616_1139_43.yml'
Set up a larger structure to accommodate and compare multiple terms;
First create multiple term/filename pairs
In [5]:
diseases = {'consumption': 'diseases_20150616_1139_38.yml', 'cholera': 'diseases_20150616_1139_9.yml', 'measles':'diseases_20150616_1139_9.yml', 'whooping':'diseases_20150616_1139_43.yml'}
In [6]:
results = {}
for search_term in diseases:
with open("data/" + diseases[search_term], 'r') as f:
results[search_term] = yaml.load(f)[search_term]
Take a look at the data for the disease type we're interested in
In [7]:
#results
Plot this out by year
In [11]:
for disease, resultors in results.items():
plt.plot(resultors.keys(), resultors.values(),'-')
plt.xlim(1800, 1900)
Out[11]:
Here, we are examining the total number of books published over the period to see how much our search terms are affected by the way that the number of books (and pages, and words!) published increases over the measurement period.
This is an older estimate and needs to be updated based on data drawn directly from the corpus
These stages carry out normalisation: dividing the per year word occurence with the per year book occurence to get a words per book per year measure.
With new data we can normalise this as a ratio of words/word
In [12]:
normal_filename = 'normaliser_20150616_1844.yml'
with open('data/' + normal_filename, 'r') as f:
publication = yaml.load(f)
Do a normalisation: 0 - by book 1 - by page 2 - by word
In [34]:
normed = {}
norm_field = 2
for disease, resultors in results.items():
normed[disease] = {}
for year in publication:
if year in resultors:
normed[disease][year] = 1000000*float(resultors[year])/publication[year][norm_field]
#for year, value in normed[disease].items():
#normed[disease][year] = value/max(normed[disease].values())
In [39]:
for disease, resultors in normed.items():
plt.plot(resultors.keys(), resultors.values())
plt.xlim(1750, 1900)
plt.legend(normed.keys(), loc=1)
xlabel('Year')
ylabel('Number of instances per million words')
plt.ylim(0, 35)
plt.savefig("diseases.jpg")
This was an attempt to smooth, but would require pre-interpolation to create even intervals
In [21]:
from numpy import convolve
In [59]:
smoothed = {}
smoothor = [1,1,1,1,1]
for disease, resultors in normed.items():
smoothed[disease] = {}
for year in publication:
if year in resultors:
smoothed[disease] = 1
In [17]:
from bokeh.plotting import figure, output_file, show
In [40]:
output_file("All diseases.html", title="Number of references per million words by year")
p = figure(title= "Disease references", x_axis_label='Year', y_axis_label='Number of references per million words')
colors = ['red', 'green', 'blue', 'black']
countor = 0
for dis, norm in normed.items():
p.line(norm.keys(), norm.values(), line_color = colors[countor], legend = dis)
p.scatter(norm.keys(), norm.values(), line_color = colors[countor], fill_color = colors[countor], legend = dis)
countor +=1
#legend(colors)
show(p)
In [30]:
output_file("All diseases_scatter.html", title="Number of references per 1000 words by year")
p = figure(title= "Disease references", x_axis_label='Year', y_axis_label='Number of references per 1000 words')
colors = ['red', 'green', 'blue', 'black']
countor = 0
for dis, norm in normed.items():
p.scatter(norm.keys(), norm.values(), line_color = colors[countor], legend = dis)
countor +=1
#legend(colors)
show(p)
In [ ]: