Working with Geographical Data

Let's look at some real life data sets and see how we would analyze them using Julia. For this particular notebook, I will be working with data collected about US counties by various government agencies, but the general techniques we will see apply equally well to any other dataset. Here's the list of packages I'll be using (don't worry if you don't know what they do, I'll introduce them later):


In [22]:
using Shapefile
using DataArrays
using DataFrames
using Color
using SIUnits
using Requests

In [2]:
using Compose
using Gadfly


Warning: New definition 
    *(MeasureNil,Any) at /Users/kfischer/.julia/Compose/src/measure.jl:32
is ambiguous with: 
    *(Any,AbstractCalendarDuration) at /Users/kfischer/.julia/Calendar/src/Calendar.jl:341.
To fix, define 
    *(MeasureNil,AbstractCalendarDuration)
before the new definition.

In [3]:
using Taro
Taro.init()


WARNING: deprecated syntax "x[i:]" at /Users/kfischer/.julia/Compose/src/cairo_backends.jl:380.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /Users/kfischer/.julia/Compose/src/cairo_backends.jl:398.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /Users/kfischer/.julia/Compose/src/cairo_backends.jl:562.
Use "x[i:end]" instead.
Found libjvm @ /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/server

In [3]:
Compose.set_default_graphic_size(10inch,10inch/golden);
set_default_plot_size(10inch,10inch/golden);

Getting and importing the data

Before we can get started, we need to get ourselves some data. I will mostly be working with the datasets collected by the Census bureau here. But before we get any data from there, let start with something simple. I have preparred a CSV file with the names and states all the counties in the U.S. here. All the data from the Census will be coded by FIPS (Federal Information Processing Standard) code. This data files will help us map from FIPS code to county name.


In [4]:
get_file(URL,file) = isfile(file) || download(URL,file)


Out[4]:
get_file (generic function with 1 method)

In [5]:
get_file("https://gist.github.com/loladiro/8573181/raw/34894ec1a3cae030490b83be3a36b2166852e000/gistfile1.txt","USCounties.csv")
counties = readtable("USCounties.csv");
# Here's the first 10 entries so you can get a feel for what the data looks like
counties[1:10,:]


Out[5]:
NAMESTATE_NAMESTATE_FIPSCNTY_FIPSFIPS
1Lake of the WoodsMinnesota277727077
2FerryWashington531953019
3StevensWashington536553065
4OkanoganWashington534753047
5Pend OreilleWashington535153051
6BoundaryIdaho162116021
7LincolnMontana305330053
8FlatheadMontana302930029
9GlacierMontana303530035
10TooleMontana3010130101

In [5]:
#STATE_FIPS and CNTY_FIPS are redundant really, so let's get rid of them
delete!(counties,"STATE_FIPS")
delete!(counties,"CNTY_FIPS");

Next, let's get information about the land mass of each county.

Going back to the census website we find just the right data at http://www2.census.gov/prod2/statcomp/usac/excel/LND01.xls.

If you're wondering what the column names mean, there's a comprehensive overview at http://www2.census.gov/prod2/statcomp/usac/excel/Mastdata.xls


In [5]:
get_file("http://www2.census.gov/prod2/statcomp/usac/excel/LND01.xls","LND01.xls");

We will be using the Taro package to read the Excel file like so:


In [6]:
LND = Taro.readxl("/Users/kfischer/Downloads/LND01.xls","Sheet1","A1:AH3199");
LND[1:5,1:10] # Again, a small sneak preview


Out[6]:
AreanameSTCOULND010190FLND010190DLND010190N1LND010190N2LND010200FLND010200DLND010200N1LND010200N2
1UNITED STATES000000.03.78742508e6000000000.03.79408306e600000000
2ALABAMA010000.052422.94000000000.052419.0200000000
3Autauga, AL010010.0604.49000000000.0604.4500000000
4Baldwin, AL010030.02027.08000000000.02026.9300000000
5Barbour, AL010050.0904.59000000000.0904.5200000000

In [7]:
# Let's get rind of some of the data we don't need and rename the columns to be more human-friendly
# We're interested in the FIPS code, the total land area and the total water area
landdata = LND[["STCOU","LND010200D","LND210200D"]];
# Rename the columns
rename!(landdata,"STCOU","FIPS")
rename!(landdata,"LND010200D","land_area")
rename!(landdata,"LND210200D","water_area")
# Let's also get rid of counties for which there's no data:
landdata = landdata[find(x->x!=0,landdata["land_area"]),:]
landdata = landdata[find(x->x!=0,landdata["water_area"]),:]
# Let's also make FIPS an integer
landdata["FIPS"] = int(landdata["FIPS"]);
landdata[1:5,:]


Out[7]:
FIPSland_areawater_area
103.79408306e6256644.62
2100052419.021675.01
31001604.458.48
410032026.93430.58
51005904.5219.61

In [8]:
# Let's combine it with the CSV we originally loaded (by FIPS code)
data = join(counties, landdata; on="FIPS")
data[1:5,:]


Out[8]:
NAMESTATE_NAMEFIPSland_areawater_area
1AutaugaAlabama1001604.458.48
2BaldwinAlabama10032026.93430.58
3BarbourAlabama1005904.5219.61
4BibbAlabama1007626.163.14
5BlountAlabama1009650.65.02

In [9]:
set_default_plot_size(10inch,10inch/golden);
plot(data, x="land_area", y="water_area")
#, color="STATE_NAME", Scale.y_log10, Scale.x_log10


Out[9]:

In [10]:
plot(data, x="land_area", Geom.histogram, color="STATE_NAME", Scale.x_log10)


Out[10]:

Plotting on a map

Since we're dealing with geographical data, it would be great to plot it on a map. I'll use this as an example how one might build a custom visualization in Julia using the Compose package. So let's get started.

Since I want to plot US counties, we first need to know how to draw them. For that purpose, I googled "US counties shapefile" and downloaded the USCounties.zip file from here (the first hit on google). To follow along, download it and extract it into the current directory.

The .shp (shapefile) format is a very simple file format for defining shapes and is commonly used in mapping. I wrote a simple package, to read this file format: https://github.com/loladiro/Shapefile.jl. It's about 130 lines of code. There's no need to really understand it, but do have a look at the source if you're curious.


In [10]:
shp = open("USCounties.shp") do fd
    read(fd,Shapefile.Handle)
end;

At this point we have the shape file loaded. By loading the file in an external viewer determined that the shapes are in the same order as the original CSV, what a coincidence ;). This information is embedded in the shapefile, but my package doesn't read it out, yet, though this would be a fun projet to add !


In [10]:
# Since the order is the same we can just add it as an extra column
counties["Shape"] = shp.shapes;

In [11]:
data = join(counties, landdata; on="FIPS");
data[1:5,:]


Out[11]:
NAMESTATE_NAMEFIPSShapeland_areawater_area
1AutaugaAlabama1001Polygon(39 Float64 Points)604.458.48
2BaldwinAlabama1003Polygon(69 Float64 Points)2026.93430.58
3BarbourAlabama1005Polygon(37 Float64 Points)904.5219.61
4BibbAlabama1007Polygon(33 Float64 Points)626.163.14
5BlountAlabama1009Polygon(55 Float64 Points)650.65.02

In [12]:
# Going from one kind of polygon to another
ESRI2Compose(poly::Shapefile.Polygon) =
Compose.FormTree(Compose.Polygon([Compose.Point(point.x,point.y) for point in poly.points]))


Out[12]:
ESRI2Compose (generic function with 1 method)

In [13]:
# Drawing counties
function generateCountyOutline(data,color,dims)
    template = set_unit_box(canvas(),dims)
    c = canvas(template)
    for row in EachRow(data)
        c = compose(c,compose(canvas(template),ESRI2Compose(row["Shape"][1]),
            linewidth(0.05mm),stroke(color),fill(nothing)))
    end
    c
end
dims = UnitBox(shp.MBR.left,shp.MBR.bottom,(shp.MBR.right-shp.MBR.left),-(shp.MBR.bottom-shp.MBR.top))
generateCountyOutline(data,color) = generateCountyOutline(data,color,dims)


Out[13]:
generateCountyOutline (generic function with 2 methods)

In [14]:
generateCountyOutline(data,color("black"))


Out[14]:

In [14]:
# Ok that was a little small, let's get rid of Hawaii and Alaska
counties2 = copy(counties)
DataFrames.deleterows!(counties2,find(!((counties2["STATE_NAME"].=="Alaska")|(counties2["STATE_NAME"].=="Hawaii"))))
# We need to recalculate the bounding rectangle
minx = minimum(map(shape->mapreduce(p->p.x,min,shape.points),counties2["Shape"]))
maxx = maximum(map(shape->mapreduce(p->p.x,max,shape.points),counties2["Shape"]))
miny = minimum(map(shape->mapreduce(p->p.y,min,shape.points),counties2["Shape"]))
maxy = maximum(map(shape->mapreduce(p->p.y,max,shape.points),counties2["Shape"]))
dims2 = UnitBox(minx,maxy,(maxx-minx),-(maxy-miny));

In [15]:
data2 = join(counties2, landdata; on=:FIPS)
generateCountyOutline(data2,color("black"),dims2)


Out[15]:

In [16]:
grad = Gadfly.lab_gradient(color("white"),color("blue"))
function plotData(data,colname,dims,transform=identity)
   #dist = Normal(0.0, 1.0)
   template = set_unit_box(canvas(),dims)
   c = canvas(template)
   minv = transform(minimum(data[colname]))
   maxv = transform(maximum(data[colname]))
   for row in EachRow(data)
     v = transform(float64(isna(row[colname][1]) ? 0.0 : row[colname][1]))
     p = (v-minv)/(maxv-minv)
    c=compose(c,compose(canvas(template),ESRI2Compose(row["Shape"][1]),linewidth(0.05mm),stroke(color("black")),fill(grad(p))))
   end
  c
end


Out[16]:
plotData (generic function with 2 methods)

In [17]:
plotData(data2,"land_area",dims2,log10)


Out[17]:

Now it's your turn. Pick your favorite data set from the census website and ee what you can com up with. As one more example that you can modify, here is Population/Area:


In [17]:
get_file("http://www2.census.gov/prod2/statcomp/usac/excel/PST01.xls","PST01.xls");

In [18]:
PST = Taro.readxl("PST01.xls","Sheet4","A1:AP3199");
PST[1:5,1:10]


Out[18]:
AreanameSTCOUPST045200FPST045200DPST045200N1PST045200N2PST045201FPST045201DPST045201N1PST045201N2
1UNITED STATES000000.02.82171957e8000000000.02.85081556e800000000
2ALABAMA010000.04.451849e6000000000.04.464034e600000000
3Autauga, AL010010.043872.0000000000.044434.000000000
4Baldwin, AL010030.0141358.0000000000.0144988.000000000
5Barbour, AL010050.029035.0000000000.029223.000000000

In [19]:
popdata = PST[["STCOU","PST045200D"]]
rename!(popdata,"STCOU","FIPS")
rename!(popdata,"PST045200D","population")
popdata["FIPS"] = int(popdata["FIPS"])
data3 = join(join(counties2, popdata; on = "FIPS"), landdata; on="FIPS")
data3["pop/area"] = data3["population"]./data3["land_area"];
data3[1:10,:]


Out[19]:
NAMESTATE_NAMEFIPSShapepopulationland_areawater_areapop/area
1AutaugaAlabama1001Polygon(39 Float64 Points)43872.0604.458.4872.58168583009346
2BaldwinAlabama1003Polygon(69 Float64 Points)141358.02026.93430.5869.73995155234763
3BarbourAlabama1005Polygon(37 Float64 Points)29035.0904.5219.6132.09989828859506
4BibbAlabama1007Polygon(33 Float64 Points)19936.0626.163.1431.838507729653763
5BlountAlabama1009Polygon(55 Float64 Points)51181.0650.65.0278.6673839532739
6BullockAlabama1011Polygon(38 Float64 Points)11604.0626.061.0418.534964699869025
7ButlerAlabama1013Polygon(18 Float64 Points)21313.0777.921.0527.397418757712877
8CalhounAlabama1015Polygon(66 Float64 Points)111342.0612.323.86181.83629474784425
9ChambersAlabama1017Polygon(16 Float64 Points)36593.0603.115.9460.67384059292666
10CherokeeAlabama1019Polygon(34 Float64 Points)24053.0599.9546.8340.09167430619218

In [20]:
plotData(data3,"pop/area",dims2,log10)


Out[20]:

In [21]: