Prevalence of Disease

The larger query has examined the presence and rate of occurence of certain disease names in corpus. Here we want 1) to look at this data, 2) graph it per year, and eventually 3) compare these diseases. We will also need to 4) normalise the data, to account for increased publishing activity over time.

Import relevant libraries and data


In [3]:
import yaml
import matplotlib.pyplot as plt
disease = 'consumption'
filename = 'diseases_20150616_1139_38.yml'

In [4]:
disease = 'cholera'
filename = 'diseases_20150616_1139_9.yml'

disease = 'measles'
filename = 'diseases_20150616_1139_9.yml'

disease = 'whooping'
filename = 'diseases_20150616_1139_43.yml'

Set up a larger structure to accommodate and compare multiple terms;

First create multiple term/filename pairs


In [5]:
diseases = {'consumption': 'diseases_20150616_1139_38.yml', 'cholera': 'diseases_20150616_1139_9.yml', 'measles':'diseases_20150616_1139_9.yml', 'whooping':'diseases_20150616_1139_43.yml'}

In [6]:
results = {}
for search_term in diseases:
    with open("data/" + diseases[search_term], 'r') as f:
        results[search_term] = yaml.load(f)[search_term]

Take a look at the data for the disease type we're interested in


In [7]:
#results

Plot this out by year


In [11]:
for disease, resultors in results.items():
    plt.plot(resultors.keys(), resultors.values(),'-')
plt.xlim(1800, 1900)


Out[11]:
(1800, 1900)

Normalisation

Here, we are examining the total number of books published over the period to see how much our search terms are affected by the way that the number of books (and pages, and words!) published increases over the measurement period.

This is an older estimate and needs to be updated based on data drawn directly from the corpus

These stages carry out normalisation: dividing the per year word occurence with the per year book occurence to get a words per book per year measure.

With new data we can normalise this as a ratio of words/word


In [12]:
normal_filename = 'normaliser_20150616_1844.yml'
with open('data/' + normal_filename, 'r') as f:
    publication = yaml.load(f)

Do a normalisation: 0 - by book 1 - by page 2 - by word


In [34]:
normed = {}
norm_field = 2
for disease, resultors in results.items(): 
    normed[disease] = {}
    for year in publication:
        if year in resultors: 
            normed[disease][year] = 1000000*float(resultors[year])/publication[year][norm_field]
    #for year, value in normed[disease].items():
        #normed[disease][year] = value/max(normed[disease].values())

In [39]:
for disease, resultors in normed.items():
    plt.plot(resultors.keys(), resultors.values())
plt.xlim(1750, 1900)
plt.legend(normed.keys(), loc=1)
xlabel('Year')
ylabel('Number of instances per million words')
plt.ylim(0, 35)
plt.savefig("diseases.jpg")


This was an attempt to smooth, but would require pre-interpolation to create even intervals


In [21]:
from numpy import convolve

In [59]:
smoothed = {}
smoothor = [1,1,1,1,1]
for disease, resultors in normed.items(): 
    smoothed[disease] = {}
    for year in publication:
        if year in resultors: 
            smoothed[disease] = 1

Bokeh

Bokeh is a library which lets us create interactive web-based graphs and charts.

Here, we're importing the libary, creating a graph with appropriate axis names, and displaying it in a web browser. We could put this on a web server or embed it in an html page (one which allows JavaScript).


In [17]:
from bokeh.plotting import figure, output_file, show


BokehJS successfully loaded.

In [40]:
output_file("All diseases.html", title="Number of references per million words by year")
p = figure(title= "Disease references", x_axis_label='Year', y_axis_label='Number of references per million words')
colors = ['red', 'green', 'blue', 'black']
countor = 0
for dis, norm in normed.items():
    p.line(norm.keys(), norm.values(), line_color = colors[countor], legend = dis)
    p.scatter(norm.keys(), norm.values(), line_color = colors[countor], fill_color = colors[countor], legend = dis)
    countor +=1
#legend(colors)
show(p)


Session output file 'All diseases.html' already exists, will be overwritten.

In [30]:
output_file("All diseases_scatter.html", title="Number of references per 1000 words by year")
p = figure(title= "Disease references", x_axis_label='Year', y_axis_label='Number of references per 1000 words')
colors = ['red', 'green', 'blue', 'black']
countor = 0
for dis, norm in normed.items():
    p.scatter(norm.keys(), norm.values(), line_color = colors[countor], legend = dis)

    countor +=1
#legend(colors)
show(p)


Session output file 'All diseases_scatter.html' already exists, will be overwritten.

In [ ]: