Working with Geographical Data

Let's look at some real life data sets and see how we would analyze them using Julia. For this particular notebook, I will be working with data collected about US counties by various government agencies, but the general techniques we will see apply equally well to any other dataset. Here's the list of packages I'll be using (don't worry if you don't know what they do, I'll introduce them later):



In [22]:

    
using Shapefile
using DataArrays
using DataFrames
using Color
using SIUnits
using Requests



In [2]:

    
using Compose
using Gadfly









    



Warning: New definition 
    *(MeasureNil,Any) at /Users/kfischer/.julia/Compose/src/measure.jl:32
is ambiguous with: 
    *(Any,AbstractCalendarDuration) at /Users/kfischer/.julia/Calendar/src/Calendar.jl:341.
To fix, define 
    *(MeasureNil,AbstractCalendarDuration)
before the new definition.



In [3]:

    
using Taro
Taro.init()









    



WARNING: deprecated syntax "x[i:]" at /Users/kfischer/.julia/Compose/src/cairo_backends.jl:380.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /Users/kfischer/.julia/Compose/src/cairo_backends.jl:398.
Use "x[i:end]" instead.

WARNING: deprecated syntax "x[i:]" at /Users/kfischer/.julia/Compose/src/cairo_backends.jl:562.
Use "x[i:end]" instead.






    



Found libjvm @ /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/server



In [3]:

    
Compose.set_default_graphic_size(10inch,10inch/golden);
set_default_plot_size(10inch,10inch/golden);

Getting and importing the data

Before we can get started, we need to get ourselves some data. I will mostly be working with the datasets collected by the Census bureau here. But before we get any data from there, let start with something simple. I have preparred a CSV file with the names and states all the counties in the U.S. here. All the data from the Census will be coded by FIPS (Federal Information Processing Standard) code. This data files will help us map from FIPS code to county name.



In [4]:

    
get_file(URL,file) = isfile(file) || download(URL,file)









    Out[4]:





get_file (generic function with 1 method)



In [5]:

    
get_file("https://gist.github.com/loladiro/8573181/raw/34894ec1a3cae030490b83be3a36b2166852e000/gistfile1.txt","USCounties.csv")
counties = readtable("USCounties.csv");
# Here's the first 10 entries so you can get a feel for what the data looks like
counties[1:10,:]









    Out[5]:




NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS
1 Lake of the Woods Minnesota 27 77 27077
2 Ferry Washington 53 19 53019
3 Stevens Washington 53 65 53065
4 Okanogan Washington 53 47 53047
5 Pend Oreille Washington 53 51 53051
6 Boundary Idaho 16 21 16021
7 Lincoln Montana 30 53 30053
8 Flathead Montana 30 29 30029
9 Glacier Montana 30 35 30035
10 Toole Montana 30 101 30101



In [5]:

    
#STATE_FIPS and CNTY_FIPS are redundant really, so let's get rid of them
delete!(counties,"STATE_FIPS")
delete!(counties,"CNTY_FIPS");

Next, let's get information about the land mass of each county.

Going back to the census website we find just the right data at http://www2.census.gov/prod2/statcomp/usac/excel/LND01.xls.

If you're wondering what the column names mean, there's a comprehensive overview at http://www2.census.gov/prod2/statcomp/usac/excel/Mastdata.xls



In [5]:

    
get_file("http://www2.census.gov/prod2/statcomp/usac/excel/LND01.xls","LND01.xls");

We will be using the Taro package to read the Excel file like so:



In [6]:

    
LND = Taro.readxl("/Users/kfischer/Downloads/LND01.xls","Sheet1","A1:AH3199");
LND[1:5,1:10] # Again, a small sneak preview









    Out[6]:




Areaname STCOU LND010190F LND010190D LND010190N1 LND010190N2 LND010200F LND010200D LND010200N1 LND010200N2
1 UNITED STATES 00000 0.0 3.78742508e6 0000 0000 0.0 3.79408306e6 0000 0000
2 ALABAMA 01000 0.0 52422.94 0000 0000 0.0 52419.02 0000 0000
3 Autauga, AL 01001 0.0 604.49 0000 0000 0.0 604.45 0000 0000
4 Baldwin, AL 01003 0.0 2027.08 0000 0000 0.0 2026.93 0000 0000
5 Barbour, AL 01005 0.0 904.59 0000 0000 0.0 904.52 0000 0000



In [7]:

    
# Let's get rind of some of the data we don't need and rename the columns to be more human-friendly
# We're interested in the FIPS code, the total land area and the total water area
landdata = LND[["STCOU","LND010200D","LND210200D"]];
# Rename the columns
rename!(landdata,"STCOU","FIPS")
rename!(landdata,"LND010200D","land_area")
rename!(landdata,"LND210200D","water_area")
# Let's also get rid of counties for which there's no data:
landdata = landdata[find(x->x!=0,landdata["land_area"]),:]
landdata = landdata[find(x->x!=0,landdata["water_area"]),:]
# Let's also make FIPS an integer
landdata["FIPS"] = int(landdata["FIPS"]);
landdata[1:5,:]









    Out[7]:




FIPS land_area water_area
1 0 3.79408306e6 256644.62
2 1000 52419.02 1675.01
3 1001 604.45 8.48
4 1003 2026.93 430.58
5 1005 904.52 19.61



In [8]:

    
# Let's combine it with the CSV we originally loaded (by FIPS code)
data = join(counties, landdata; on="FIPS")
data[1:5,:]









    Out[8]:




NAME STATE_NAME FIPS land_area water_area
1 Autauga Alabama 1001 604.45 8.48
2 Baldwin Alabama 1003 2026.93 430.58
3 Barbour Alabama 1005 904.52 19.61
4 Bibb Alabama 1007 626.16 3.14
5 Blount Alabama 1009 650.6 5.02



In [9]:

    
set_default_plot_size(10inch,10inch/golden);
plot(data, x="land_area", y="water_area")
#, color="STATE_NAME", Scale.y_log10, Scale.x_log10









    













    Out[9]:



In [10]:

    
plot(data, x="land_area", Geom.histogram, color="STATE_NAME", Scale.x_log10)

Plotting on a map

Since we're dealing with geographical data, it would be great to plot it on a map. I'll use this as an example how one might build a custom visualization in Julia using the Compose package. So let's get started.

Since I want to plot US counties, we first need to know how to draw them. For that purpose, I googled "US counties shapefile" and downloaded the USCounties.zip file from here (the first hit on google). To follow along, download it and extract it into the current directory.

The .shp (shapefile) format is a very simple file format for defining shapes and is commonly used in mapping. I wrote a simple package, to read this file format: https://github.com/loladiro/Shapefile.jl. It's about 130 lines of code. There's no need to really understand it, but do have a look at the source if you're curious.



In [10]:

    
shp = open("USCounties.shp") do fd
    read(fd,Shapefile.Handle)
end;

At this point we have the shape file loaded. By loading the file in an external viewer determined that the shapes are in the same order as the original CSV, what a coincidence ;). This information is embedded in the shapefile, but my package doesn't read it out, yet, though this would be a fun projet to add !



In [10]:

    
# Since the order is the same we can just add it as an extra column
counties["Shape"] = shp.shapes;



In [11]:

    
data = join(counties, landdata; on="FIPS");
data[1:5,:]









    Out[11]:




NAME STATE_NAME FIPS Shape land_area water_area
1 Autauga Alabama 1001 Polygon(39 Float64 Points) 604.45 8.48
2 Baldwin Alabama 1003 Polygon(69 Float64 Points) 2026.93 430.58
3 Barbour Alabama 1005 Polygon(37 Float64 Points) 904.52 19.61
4 Bibb Alabama 1007 Polygon(33 Float64 Points) 626.16 3.14
5 Blount Alabama 1009 Polygon(55 Float64 Points) 650.6 5.02



In [12]:

    
# Going from one kind of polygon to another
ESRI2Compose(poly::Shapefile.Polygon) =
Compose.FormTree(Compose.Polygon([Compose.Point(point.x,point.y) for point in poly.points]))









    Out[12]:





ESRI2Compose (generic function with 1 method)



In [13]:

    
# Drawing counties
function generateCountyOutline(data,color,dims)
    template = set_unit_box(canvas(),dims)
    c = canvas(template)
    for row in EachRow(data)
        c = compose(c,compose(canvas(template),ESRI2Compose(row["Shape"][1]),
            linewidth(0.05mm),stroke(color),fill(nothing)))
    end
    c
end
dims = UnitBox(shp.MBR.left,shp.MBR.bottom,(shp.MBR.right-shp.MBR.left),-(shp.MBR.bottom-shp.MBR.top))
generateCountyOutline(data,color) = generateCountyOutline(data,color,dims)









    Out[13]:





generateCountyOutline (generic function with 2 methods)



In [14]:

    
generateCountyOutline(data,color("black"))



In [14]:

    
# Ok that was a little small, let's get rid of Hawaii and Alaska
counties2 = copy(counties)
DataFrames.deleterows!(counties2,find(!((counties2["STATE_NAME"].=="Alaska")|(counties2["STATE_NAME"].=="Hawaii"))))
# We need to recalculate the bounding rectangle
minx = minimum(map(shape->mapreduce(p->p.x,min,shape.points),counties2["Shape"]))
maxx = maximum(map(shape->mapreduce(p->p.x,max,shape.points),counties2["Shape"]))
miny = minimum(map(shape->mapreduce(p->p.y,min,shape.points),counties2["Shape"]))
maxy = maximum(map(shape->mapreduce(p->p.y,max,shape.points),counties2["Shape"]))
dims2 = UnitBox(minx,maxy,(maxx-minx),-(maxy-miny));



In [15]:

    
data2 = join(counties2, landdata; on=:FIPS)
generateCountyOutline(data2,color("black"),dims2)



In [16]:

    
grad = Gadfly.lab_gradient(color("white"),color("blue"))
function plotData(data,colname,dims,transform=identity)
   #dist = Normal(0.0, 1.0)
   template = set_unit_box(canvas(),dims)
   c = canvas(template)
   minv = transform(minimum(data[colname]))
   maxv = transform(maximum(data[colname]))
   for row in EachRow(data)
     v = transform(float64(isna(row[colname][1]) ? 0.0 : row[colname][1]))
     p = (v-minv)/(maxv-minv)
    c=compose(c,compose(canvas(template),ESRI2Compose(row["Shape"][1]),linewidth(0.05mm),stroke(color("black")),fill(grad(p))))
   end
  c
end









    Out[16]:





plotData (generic function with 2 methods)



In [17]:

    
plotData(data2,"land_area",dims2,log10)

Now it's your turn. Pick your favorite data set from the census website and ee what you can com up with. As one more example that you can modify, here is Population/Area:



In [17]:

    
get_file("http://www2.census.gov/prod2/statcomp/usac/excel/PST01.xls","PST01.xls");



In [18]:

    
PST = Taro.readxl("PST01.xls","Sheet4","A1:AP3199");
PST[1:5,1:10]









    Out[18]:




Areaname STCOU PST045200F PST045200D PST045200N1 PST045200N2 PST045201F PST045201D PST045201N1 PST045201N2
1 UNITED STATES 00000 0.0 2.82171957e8 0000 0000 0.0 2.85081556e8 0000 0000
2 ALABAMA 01000 0.0 4.451849e6 0000 0000 0.0 4.464034e6 0000 0000
3 Autauga, AL 01001 0.0 43872.0 0000 0000 0.0 44434.0 0000 0000
4 Baldwin, AL 01003 0.0 141358.0 0000 0000 0.0 144988.0 0000 0000
5 Barbour, AL 01005 0.0 29035.0 0000 0000 0.0 29223.0 0000 0000



In [19]:

    
popdata = PST[["STCOU","PST045200D"]]
rename!(popdata,"STCOU","FIPS")
rename!(popdata,"PST045200D","population")
popdata["FIPS"] = int(popdata["FIPS"])
data3 = join(join(counties2, popdata; on = "FIPS"), landdata; on="FIPS")
data3["pop/area"] = data3["population"]./data3["land_area"];
data3[1:10,:]









    Out[19]:




NAME STATE_NAME FIPS Shape population land_area water_area pop/area
1 Autauga Alabama 1001 Polygon(39 Float64 Points) 43872.0 604.45 8.48 72.58168583009346
2 Baldwin Alabama 1003 Polygon(69 Float64 Points) 141358.0 2026.93 430.58 69.73995155234763
3 Barbour Alabama 1005 Polygon(37 Float64 Points) 29035.0 904.52 19.61 32.09989828859506
4 Bibb Alabama 1007 Polygon(33 Float64 Points) 19936.0 626.16 3.14 31.838507729653763
5 Blount Alabama 1009 Polygon(55 Float64 Points) 51181.0 650.6 5.02 78.6673839532739
6 Bullock Alabama 1011 Polygon(38 Float64 Points) 11604.0 626.06 1.04 18.534964699869025
7 Butler Alabama 1013 Polygon(18 Float64 Points) 21313.0 777.92 1.05 27.397418757712877
8 Calhoun Alabama 1015 Polygon(66 Float64 Points) 111342.0 612.32 3.86 181.83629474784425
9 Chambers Alabama 1017 Polygon(16 Float64 Points) 36593.0 603.11 5.94 60.67384059292666
10 Cherokee Alabama 1019 Polygon(34 Float64 Points) 24053.0 599.95 46.83 40.09167430619218



In [20]:

    
plotData(data3,"pop/area",dims2,log10)



In [21]:

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS
1	Lake of the Woods	Minnesota	27	77	27077
2	Ferry	Washington	53	19	53019
3	Stevens	Washington	53	65	53065
4	Okanogan	Washington	53	47	53047
5	Pend Oreille	Washington	53	51	53051
6	Boundary	Idaho	16	21	16021
7	Lincoln	Montana	30	53	30053
8	Flathead	Montana	30	29	30029
9	Glacier	Montana	30	35	30035
10	Toole	Montana	30	101	30101

	Areaname	STCOU	LND010190D	LND010200D
1	UNITED STATES	00000	3.78742508e6	3.79408306e6
2	ALABAMA	01000	52422.94	52419.02
3	Autauga, AL	01001	604.49	604.45
4	Baldwin, AL	01003	2027.08	2026.93
5	Barbour, AL	01005	904.59	904.52

	FIPS	land_area	water_area
1	0	3.79408306e6	256644.62
2	1000	52419.02	1675.01
3	1001	604.45	8.48
4	1003	2026.93	430.58
5	1005	904.52	19.61

	NAME	STATE_NAME	FIPS	land_area	water_area
1	Autauga	Alabama	1001	604.45	8.48
2	Baldwin	Alabama	1003	2026.93	430.58
3	Barbour	Alabama	1005	904.52	19.61
4	Bibb	Alabama	1007	626.16	3.14
5	Blount	Alabama	1009	650.6	5.02

	Areaname	STCOU	PST045200D	PST045201D
1	UNITED STATES	00000	2.82171957e8	2.85081556e8
2	ALABAMA	01000	4.451849e6	4.464034e6
3	Autauga, AL	01001	43872.0	44434.0
4	Baldwin, AL	01003	141358.0	144988.0
5	Barbour, AL	01005	29035.0	29223.0