This notebook was downloaded from https://anaconda.org/jbednar/census/notebook. It was presented by Dr. James Bednar of Continuum IO in a webinar on February 2, 2016. See this link for the slides.
See the file datashader_LICENSE.txt for the license (allowing redistribution under certain conditions).
Modifications have been made to the data input (HDF5 instead of castra). The HDF5 file (1.9 GB) is available here
The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. Here we'll focus on the subset of the data selected by the Cooper Center, who produced a map of the population density and the racial makeup of the USA (http://www.coopercenter.org/demographics/Racial-Dot-Map). Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the block level, so that the racial category can only be determined to within a rough geographic precision.) The Cooper Center website delivers pre-rendered tiles, which is fast but limited to the plotting choices they made; here we will show how to run novel analyses focusing on whatever aspects of the data that you select yourself, rendered dynamically as requested using the datashader library.
First, let's load this data using dask. Dask lets you work with data in a variety of formats, and supports both in-core and out-of-core operation. In-core operation is faster, so here we'll direct dask to load the entire dataset into memory (df.cache(cache=dict) below), but if you have less than 16GB RAM you may want to comment out that line to enable the out-of-core support. The data have been converted into castra format, which supports quick access, but CSV could have been used (with some performance penalty).
In [ ]:
import datashader as ds
import datashader.transfer_functions as tf
import dask.dataframe as dd
import numpy as np
Note: this issue details that the castra cannot be read in Python 3.5 because it was written in Python 2.7. A workaround is to read it in Python 2.7 and write out to hdf5.
In [ ]:
%%time
#df = dd.from_castra('data/census.castra')
df = dd.read_hdf('data/census.hdf', key='census')
#df = df.cache(cache=dict)
In [ ]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning, message='.*use @default decorator instead.*')
df.tail()
The output of .tail() shows that there are more than 300 million datapoints (one per person), each with a location in Web Mercator format, and that the race for each datapoint has been encoded as a single character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic, and 'o' is other (typically Native American)).
Let's define some geographic ranges to look at later, and also a default plot size. Feel free to increase plot_width to 2000 or more if you have a very large monitor or want to save files to disk, which shouldn't greatly affect the processing time or memory requirements.
In [ ]:
USA = ((-13884029, -7453304), (2698291, 6455972))
LakeMichigan = ((-10206131, -9348029), (4975642, 5477059))
Chicago = (( -9828281, -9717659), (5096658, 5161298))
Chinatown = (( -9759210, -9754583), (5137122, 5139825))
NewYorkCity = (( -8280656, -8175066), (4940514, 4998954))
LosAngeles = ((-13195052, -13114944), (3979242, 4023720))
Houston = ((-10692703, -10539441), (3432521, 3517616))
Austin = ((-10898752, -10855820), (3525750, 3550837))
NewOrleans = ((-10059963, -10006348), (3480787, 3510555))
Atlanta = (( -9448349, -9354773), (3955797, 4007753))
x_range,y_range = USA
plot_width = int(1000)
plot_height = int(plot_width*7.0/12)
Let's also choose a background color for our results. A black background makes bright colors more vivid, and works well when later adding relatively dark satellite image backgrounds, but white backgrounds are good for examining the weakest patterns, and work well when adding vector-based maps with light colors. Try it both ways and decide for yourself!
In [ ]:
black_background = True
from IPython.core.display import HTML, display
display(HTML("<style>.container { width:100% !important; }</style>"))
We'll also need some utility functions:
In [ ]:
def export(img,filename,fmt=".png",_return=True):
"""Given a datashader Image object, saves it to a disk file in the requested format"""
if black_background: # Optional; removes transparency to force background for exported images
img=tf.set_background(img,"black")
img.to_pil().save(filename+fmt)
return img if _return else None
def cm(base_colormap, start=0, end=1.0, reverse=not black_background):
"""
Given a colormap in the form of a list, such as a Bokeh palette,
return a version of the colormap reversed if requested, and selecting
a subset (on a scale 0,1.0) of the elements in the colormap list.
For instance:
>>> cmap = ["#000000", "#969696", "#d9d9d9", "#ffffff"]
>>> cm(cmap,reverse=True)
['#ffffff', '#d9d9d9', '#969696', '#000000']
>>> cm(cmap,0.3,reverse=True)
['#d9d9d9', '#969696', '#000000']
"""
full = list(reversed(base_colormap)) if reverse else base_colormap
num = len(full)
return full[int(start*num):int(end*num)]
from datashader.colors import Greys9, Hot, viridis, inferno
In [ ]:
%%time
cvs = ds.Canvas(plot_width, plot_height, *USA)
agg = cvs.points(df, 'meterswest', 'metersnorth')
Computing this aggregate grid will take some CPU power (5 seconds on a MacBook Pro), because datashader has to iterate through the entire dataset, with hundreds of millions of points. Once the agg array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.
If we now plot the aggregate grid linearly, we can clearly see...
In [ ]:
export(tf.interpolate(agg, cmap = cm(Greys9), how='linear'),"census_gray_linear")
...almost nothing. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago, and a few other places. More hotspots can dimly be seen when using a white background than with a black, on most monitors, though they are very dim either way. In any case, for feeding 300 million points in, we're getting almost nothing back in terms of visualization.
The first thing we can do is to prevent undersampling (see plotting_pitfalls.ipynb). In the plot above, there is no way to distinguish between pixels that are part of the background, and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let's map all values that are not background to a dimly visible gray, leaving the highest-density values at white. I.e., let's discard the first 25% of the gray colormap, and linearly interpolate the population densities over the remaining range:
In [ ]:
export(tf.interpolate(agg, cmap = cm(Greys9,0.25), how='linear'),"census_gray_linear")
The above plot reveals at least that data has been measured only within the political boundaries of the continental United States, and also that many areas in the West are so poorly populated that many pixels contained not even a single person. (In datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the colormap range is shown for pixels that do have data.) Some additional population centers are now visible, at least on some monitors. But mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that's not much information to be getting out out of 300 million datapoints!
The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus this map also conveys very little information. Because the data are clearly distributed so non-uniformly, let's instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice:
In [ ]:
export(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='log'),"census_gray_log")
Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England, and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).
Clearly, we can now see much more of what's going on in this dataset, thanks to the logarithmic mapping. Yet the choice of 'log' was purely arbitrary, and one could easily imagine that other nonlinear functions would show other interesting patterns. Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities -- small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.
With this idea of showing relative differences across a large range of data values in mind, let's try the image-processing technique called histogram equalization. I.e., given a set of raw counts, map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric -- it will equalize any linearly or nonlinearly distributed integer data, regardless of the distribution:
In [ ]:
export(tf.interpolate(agg, cmap = cm(Greys9,0.2), how='eq_hist'),"census_gray_eq_hist")
(Histogram equalization also works for non-integer data, but in that case it will use a finite set of bins to divide the interval between the minimum and maximum values, and will thus not be able to normalize the histogram perfectly for highly non-uniform distributions.) Effectively, this transformation converts the data from raw magnitudes, which can easily span a much greater range than the dynamic range visible to the eye, to a rank-order or percentile representation, which reveals density differences at all ranges but obscures the absolute magnitudes involved. In this representation, you can clearly see the effects of geography (rivers, coastlines, and mountains) on the population density, as well as history (denser near the longest-populated areas), and even infrastructure (with many small towns located at crossroads).
Given the very different results from the different types of plot, a good practice when visualizing any dataset with datashader is to look at both the linear and the histogram-equalized versions of the data; the linear version preserves the magnitudes, but obscures the distribution, while the histogram-equalized version reveals the distribution while preserving only the order of the magnitudes, not their actual values. If both plots are similar, then the data is distributed nearly uniformly across the interval. But much more commonly, the distribution will be highly nonlinear, and the linear plot will reveal only the envelope of the data, i.e., the lowest and the highest values. In such cases, the histogram-equalized plot will reveal much more of the structure of the data, because it maps the local patterns in the data into perceptible color differences on the screen.
Because we are only plotting a single dimension, we can use the colors of the display to effectively reach a higher dynamic range, mapping ranges of data values into different color ranges. Here we'll use the colormap with the colors interpolated between the named colors shown:
In [ ]:
print(cm(Hot,0.2))
export(tf.interpolate(agg, cmap = cm(Hot,0.2), how='eq_hist'),"census_ds_hot_eq_hist")
You can also import colormaps directly from matplotlib.cm or bokeh.palettes, though only Bokeh palettes will work with the cm() function that lets us switch backgrounds:
In [ ]:
from bokeh.palettes import PuRd9
export(tf.interpolate(agg, cmap=cm(PuRd9), how='eq_hist'),"census_inferno_eq_hist")
So that they will work with cm, we provide Matplotlib's default viridis and inferno colormaps from within datashader:
In [ ]:
export(tf.interpolate(agg, cmap=cm(viridis), how='eq_hist'),"census_viridis_eq_hist.png")
The above colormap choices are largely a matter of personal preference, though some of them are more perceptually uniform (accurately conveying distance between data values for all colors) and some have higher dynamic ranges than others (allowing more precise differences between data values to be distinguished).
Colormaps can also be used to address very specific questions about the data itself. For instance, after histogram equalization, data should be uniformly distributed across the visible colormap. Thus if we want to highlight e.g. the top 1% of pixels, by population density, we can use a colormap divided into 100 ranges, and just change the top one to a different color:
In [ ]:
grays2 = cm([(i,i,i) for i in np.linspace(0,255,99)])
grays2 += ["red"]
export(tf.interpolate(agg, cmap = grays2, how='eq_hist'),"census_gray_redhot1_eq_hist")
The above plot now conveys nearly all the information available in the linear plot, i.e. that only a few pixels have the very highest population densities, while also conveying the structure of the data at all population density ranges via histogram equalization.
Since we've got the racial category for every pixel, we can use color to indicate the category value, instead of just extending dynamic range or highlighting percentiles as above. To do this, we first need to set up a color key for each category label, with different color keys as appropriate to make colors that stand out against the background:
In [ ]:
if black_background:
color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }
else: color_key = {'w':'blue', 'b':'green', 'a':'red', 'h':'orange', 'o':'saddlebrown'}
We can now aggregate the counts per race into grids, using ds.count_cat, instead of just a single grid with the total counts (via the default aggregate reducer ds.count), and then generate an image by colorizing each pixel using the aggregate information from each category for that pixel's location:
In [ ]:
def create_image(x_range, y_range, w=plot_width, h=plot_height):
cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
img = tf.colorize(agg, color_key, how='eq_hist')
return img
The result shows that the USA is overwhelmingly white, apart from some predominantly Hispanic regions along the Southern border, some regions with high densities of blacks in the Southeast, and a few isolated areas of category "Other" in the West (primarily Native American reservation areas).
In [ ]:
export(create_image(*USA),"Zoom 0 - USA")
Interestingly, the racial makeup has some sharp boundaries around urban centers, as we can see if we zoom in:
In [ ]:
export(create_image(*LakeMichigan),"Zoom 1 - Lake Michigan")
With sufficient zoom, it becomes clear that Chicago (like most large US cities) has both a wide diversity of racial groups, and profound geographic segregation:
In [ ]:
export(create_image(*Chicago),"Zoom 2 - Chicago")
Eventually, we can zoom in far enough to see individual datapoints. Here we can see that the Chinatown region of Chicago has, as expected, very high numbers of Asian residents, and that other nearby regions (separated by features like roads and highways) have other races, varying in how uniformly segregated they are:
In [ ]:
export(tf.spread(create_image(*Chinatown),px=int(plot_width/400)),"Zoom 3 - Chinatown")
Note that we've used the tf.spread function to enlarge each point to cover multiple pixels so that each point is clearly visible. Instead of the default circular spreading, you could choose shape='square' if you prefer, or any mask shape, e.g.:
In [ ]:
mask = np.array([[1, 1, 1, 1, 1],
[1, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 1, 1, 1, 1]])
export(tf.spread(create_image(*Chinatown), mask=mask),"Chinatown outlines")
In [ ]:
export(create_image(*NewYorkCity),"NYC")
In [ ]:
export(create_image(*LosAngeles),"LosAngeles")
In [ ]:
export(create_image(*Houston),"Houston")
In [ ]:
export(create_image(*Atlanta),"Atlanta")
In [ ]:
export(create_image(*NewOrleans),"NewOrleans")
In [ ]:
export(create_image(*Austin),"Austin")
In addition to simply visualizing categorical data, we can break it down and ask specific questions. For instance, if we switch back to the full USA and then select only the black population, we can see that blacks predominantly reside in urban areas except in the South and the East Coast:
In [ ]:
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
export(tf.interpolate(agg.sel(race='b'), cmap=cm(Greys9,0.25), how='eq_hist'),"USA blacks")
(Compare to "census_gray_eq_hist" above.)
Or we can show only those pixels where there is at least one resident from each of the racial categories white, black, Asian, and Hispanic, which mainly highlights urban areas (compare to "Zoom 0 - USA" above):
In [ ]:
agg2 = agg.where((agg.sel(race=['w', 'b', 'a', 'h']) > 0).all(dim='race')).fillna(0)
export(tf.colorize(agg2, color_key, how='eq_hist'),"USA all")
In the above plot, the colors still show the racial makeup of each pixel, but the pixels have been filtered so that only those with at least one datapoint from every race are shown.
Or we can look at all pixels where there are more black than white datapoints, which highlights predominantly black neighborhoods of large urban areas across most of the USA, but some rural areas and small towns in the South:
In [ ]:
export(tf.colorize(agg.where(agg.sel(race='w') < agg.sel(race='b')).fillna(0), color_key, how='eq_hist'),"more_blacks")
Here the colors still show the predominant race in that pixel, which is black for many of these, but in Southern California it looks like there are several large neighborhoods where blacks outnumber whites but both are outnumbered by Hispanics.
In any case, the thing to do here is to try out your own hypotheses and questions, whether for the USA or for your own region. The aggregate array is just an ordinary xarray multidimensional array, so you can see the xarray documentation for how to select and transform that data. E.g. you can try posing questions that are independent of the number of datapoints in each pixel, since that varies so much geographically, by normalizing the aggregated data in various ways. Now that the data's been aggregated but not yet rendered to the screen, there is an infinite range of queries you can pose!
The above plots all show static images on their own. datashader is independent of Bokeh, but when combined with Bokeh, it is easy to make interactive plots that incorporate maps, satellite imagery, annotations, legends, and hover-tool information.
To do this, let's first create a basic Bokeh plot:
In [ ]:
x_range
In [ ]:
import bokeh.plotting as bp
from bokeh.models.tiles import WMTSTileSource
bp.output_notebook()
def base_plot(tools='pan,wheel_zoom,reset',webgl=False):
p = bp.figure(tools=tools,
plot_width=int(900*1.5), plot_height=int(500*1.5),
x_range=x_range, y_range=y_range, outline_line_color=None,
min_border=0, min_border_left=0, min_border_right=0,
min_border_top=0, min_border_bottom=0, webgl=webgl)
p.axis.visible = False
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.responsive = True
return p
And then create an interactive plot using a callback to a datashader pipeline. In this pipeline, we'll use the tf.dynspread function to automatically increase the plotted size of each datapoint, once you've zoomed in so far that datapoints no longer have nearby neighbors. We'll also add some image tiles in the background, using satellite information by default (or you can uncomment the Stamen tiles line below to use vector-based map data instead):
In [ ]:
from datashader.callbacks import InteractiveImage
def image_callback(x_range, y_range, w, h):
cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'meterswest', 'metersnorth', ds.count_cat('race'))
img = tf.colorize(agg, color_key, 'log')
return tf.dynspread(img,threshold=0.75, max_px=8)
p = base_plot()
url="http://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.png"
#url="http://tile.stamen.com/toner-background/{Z}/{X}/{Y}.png"
tile_renderer = p.add_tile(WMTSTileSource(url=url))
tile_renderer.alpha=1.0 if black_background else 0.15
InteractiveImage(p, image_callback, throttle=1000)
Note that you need a live copy of the notebook, with a running server; zooming and panning will be disabled (or only work partially, which can be confusing!) in a static exported copy.
After uncommenting the last line below, you can similarly zoom into the population density data (ignoring race), or any other function of the aggregates:
In [ ]:
def image_callback2(x_range, y_range, w, h):
cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'meterswest', 'metersnorth')
img = tf.interpolate(agg, cmap = list(reversed(Greys9)))
return tf.dynspread(img,threshold=0.75, max_px=8)
p = base_plot()
#InteractiveImage(p, image_callback2, throttle=1000)
The dashboard.py example shows this same Census data in the context of an interactive dashboard, including color keys and hover information that help reveal the magnitudes at every location even while the plot faithfully reveals the structure.