Figure sizes

Here we're looking at the variation of figure sizes over time - the initial stage is to gather the data and create an initial plot


In [1]:
import yaml
import matplotlib.pyplot as plt
filename = 'find_figures_20150616_1110_6.yml'
filename = 'datain/find_figures_20150616_1110.yml'
filename = 'datain/find_all_figures/find_all_figures_20150621_0855.yml' 
#filename = 'datain/test_fig_format.yml'

This file is fairly big, so typically takes a while to load:


In [3]:
with open(filename, 'r') as f:
    results = yaml.load(f)

Here, we have a dict AND two lists to capture dates and associated page sizes (as a proportion of the total page area)


In [22]:
dates=[]
percentages=[]
tobin = {}
for year, value in results.items():
    if year>1749 and year<1851:
        for entry in value:
            for picture in entry[4]:
                if picture[5]>0 and picture[5]<101:
                    dates.append(year)
                    percentages.append(picture[5])
                    if year in tobin:
                        tobin[year].append(picture[5])
                    else:
                        tobin[year] = [picture[5]]

I'll save this smaller subset of data as a csv, which will be easier to load if I need to later (not shown)


In [23]:
import csv
writer = csv.writer(open('outputs/percentages_dict.csv', 'wb'))
for key, value in tobin.items():
   writer.writerow([key, value])

First, we plot each point out, using a low alpha (opacity) value to create a de facto heatmap. nb: at this stage, we haven't normalised the activity in each year wrt total books in the corpus from that year; this is quite a lot more sophisticated from a plotting perspective


In [42]:
plt.figure(figsize=(10, 10)) 
plt.plot(dates, percentages, 'o', alpha=0.01)
xlabel('Year')
ylabel('Size of Figures [% of page]')
title('Instances of figures by year and size; large corpus')
ylim(0, 101)
xlim(1750, 1850)
#savefig('outputs/pointmap_small.png')
savefig('outputs/pointmap_large.png')


This 2D histogram plot is a bit more like an actual heatmap:


In [41]:
heatdata, xedges, yedges = histogram2d(percentages,dates, bins = (25,25))
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]

#plt.clf()
plt.figure(figsize=(10, 10)) 
#plt.imshow(heatdata, extent=extent)
plt.pcolor(heatdata, cmap=matplotlib.cm.Blues)
plt.xlabel('Year')
plt.ylabel('Size of figures [% of page]')
plt.show()

plt.savefig("sizes_pcolor.jpg")


<matplotlib.figure.Figure at 0x71b36d710>

Now we create a real heatmap from the 2D histogram:


In [44]:
plt.figure(figsize=(10, 10)) 
plt.imshow(heatdata, extent=extent,cmap=matplotlib.cm.Blues)
plt.xlabel('Year')
plt.ylabel('Size of figures [% of page]')
plt.gca().invert_yaxis()
title("Heatmap of page sizes over time")
xlim(1750, 1850)
ylim(0, 101)
savefig("outputs/large_heatmap.jpg")


Finally, Bokeh allows us to create an interactive, zoomable version. Bear in mind that this may run a little slowly in your browser, as it's plotting each point individually, as with the first example above.


In [29]:
from bokeh.plotting import figure, output_file, show

In [40]:
output_file("outputs/figures_points.html", title="All points")
#f = figure(plot_width=400, plot_height=400)
p = figure(title= "All figures in the corpus sorted by size and year", x_axis_label='Year', y_axis_label='%')
p.scatter(dates, percentages, fill_color = 'blue', line_color = None, alpha = 0.01)
#p = HeatMap(heatdata, xlabel = 'Year', ylabel = '%')
#f.xaxis.bounds=(1750,1850)
show(p)