Let's look at some real life data sets and see how we would analyze them using Julia. For this particular notebook, I will be working with data collected about US counties by various government agencies, but the general techniques we will see apply equally well to any other dataset. Here's the list of packages I'll be using (don't worry if you don't know what they do, I'll introduce them later):
In [22]:
using Shapefile
using DataArrays
using DataFrames
using Color
using SIUnits
using Requests
In [2]:
using Compose
using Gadfly
In [3]:
using Taro
Taro.init()
In [3]:
Compose.set_default_graphic_size(10inch,10inch/golden);
set_default_plot_size(10inch,10inch/golden);
Before we can get started, we need to get ourselves some data. I will mostly be working with the datasets collected by the Census bureau here. But before we get any data from there, let start with something simple. I have preparred a CSV file with the names and states all the counties in the U.S. here. All the data from the Census will be coded by FIPS (Federal Information Processing Standard) code. This data files will help us map from FIPS code to county name.
In [4]:
get_file(URL,file) = isfile(file) || download(URL,file)
Out[4]:
In [5]:
get_file("https://gist.github.com/loladiro/8573181/raw/34894ec1a3cae030490b83be3a36b2166852e000/gistfile1.txt","USCounties.csv")
counties = readtable("USCounties.csv");
# Here's the first 10 entries so you can get a feel for what the data looks like
counties[1:10,:]
Out[5]:
In [5]:
#STATE_FIPS and CNTY_FIPS are redundant really, so let's get rid of them
delete!(counties,"STATE_FIPS")
delete!(counties,"CNTY_FIPS");
Next, let's get information about the land mass of each county.
Going back to the census website we find just the right data at http://www2.census.gov/prod2/statcomp/usac/excel/LND01.xls.
If you're wondering what the column names mean, there's a comprehensive overview at http://www2.census.gov/prod2/statcomp/usac/excel/Mastdata.xls
In [5]:
get_file("http://www2.census.gov/prod2/statcomp/usac/excel/LND01.xls","LND01.xls");
We will be using the Taro
package to read the Excel file like so:
In [6]:
LND = Taro.readxl("/Users/kfischer/Downloads/LND01.xls","Sheet1","A1:AH3199");
LND[1:5,1:10] # Again, a small sneak preview
Out[6]:
In [7]:
# Let's get rind of some of the data we don't need and rename the columns to be more human-friendly
# We're interested in the FIPS code, the total land area and the total water area
landdata = LND[["STCOU","LND010200D","LND210200D"]];
# Rename the columns
rename!(landdata,"STCOU","FIPS")
rename!(landdata,"LND010200D","land_area")
rename!(landdata,"LND210200D","water_area")
# Let's also get rid of counties for which there's no data:
landdata = landdata[find(x->x!=0,landdata["land_area"]),:]
landdata = landdata[find(x->x!=0,landdata["water_area"]),:]
# Let's also make FIPS an integer
landdata["FIPS"] = int(landdata["FIPS"]);
landdata[1:5,:]
Out[7]:
In [8]:
# Let's combine it with the CSV we originally loaded (by FIPS code)
data = join(counties, landdata; on="FIPS")
data[1:5,:]
Out[8]:
In [9]:
set_default_plot_size(10inch,10inch/golden);
plot(data, x="land_area", y="water_area")
#, color="STATE_NAME", Scale.y_log10, Scale.x_log10
Out[9]:
In [10]:
plot(data, x="land_area", Geom.histogram, color="STATE_NAME", Scale.x_log10)
Out[10]:
Since we're dealing with geographical data, it would be great to plot it on a map. I'll use this as an example how one might build a custom visualization in Julia using the Compose
package. So let's get started.
Since I want to plot US counties, we first need to know how to draw them. For that purpose, I googled "US counties shapefile" and downloaded the USCounties.zip file from here (the first hit on google). To follow along, download it and extract it into the current directory.
The .shp
(shapefile) format is a very simple file format for defining shapes and is commonly used in mapping. I wrote a simple package, to read this file format: https://github.com/loladiro/Shapefile.jl. It's about 130 lines of code. There's no need to really understand it, but do have a look at the source if you're curious.
In [10]:
shp = open("USCounties.shp") do fd
read(fd,Shapefile.Handle)
end;
At this point we have the shape file loaded. By loading the file in an external viewer determined that the shapes are in the same order as the original CSV, what a coincidence ;). This information is embedded in the shapefile, but my package doesn't read it out, yet, though this would be a fun projet to add !
In [10]:
# Since the order is the same we can just add it as an extra column
counties["Shape"] = shp.shapes;
In [11]:
data = join(counties, landdata; on="FIPS");
data[1:5,:]
Out[11]:
In [12]:
# Going from one kind of polygon to another
ESRI2Compose(poly::Shapefile.Polygon) =
Compose.FormTree(Compose.Polygon([Compose.Point(point.x,point.y) for point in poly.points]))
Out[12]:
In [13]:
# Drawing counties
function generateCountyOutline(data,color,dims)
template = set_unit_box(canvas(),dims)
c = canvas(template)
for row in EachRow(data)
c = compose(c,compose(canvas(template),ESRI2Compose(row["Shape"][1]),
linewidth(0.05mm),stroke(color),fill(nothing)))
end
c
end
dims = UnitBox(shp.MBR.left,shp.MBR.bottom,(shp.MBR.right-shp.MBR.left),-(shp.MBR.bottom-shp.MBR.top))
generateCountyOutline(data,color) = generateCountyOutline(data,color,dims)
Out[13]:
In [14]:
generateCountyOutline(data,color("black"))
Out[14]:
In [14]:
# Ok that was a little small, let's get rid of Hawaii and Alaska
counties2 = copy(counties)
DataFrames.deleterows!(counties2,find(!((counties2["STATE_NAME"].=="Alaska")|(counties2["STATE_NAME"].=="Hawaii"))))
# We need to recalculate the bounding rectangle
minx = minimum(map(shape->mapreduce(p->p.x,min,shape.points),counties2["Shape"]))
maxx = maximum(map(shape->mapreduce(p->p.x,max,shape.points),counties2["Shape"]))
miny = minimum(map(shape->mapreduce(p->p.y,min,shape.points),counties2["Shape"]))
maxy = maximum(map(shape->mapreduce(p->p.y,max,shape.points),counties2["Shape"]))
dims2 = UnitBox(minx,maxy,(maxx-minx),-(maxy-miny));
In [15]:
data2 = join(counties2, landdata; on=:FIPS)
generateCountyOutline(data2,color("black"),dims2)
Out[15]:
In [16]:
grad = Gadfly.lab_gradient(color("white"),color("blue"))
function plotData(data,colname,dims,transform=identity)
#dist = Normal(0.0, 1.0)
template = set_unit_box(canvas(),dims)
c = canvas(template)
minv = transform(minimum(data[colname]))
maxv = transform(maximum(data[colname]))
for row in EachRow(data)
v = transform(float64(isna(row[colname][1]) ? 0.0 : row[colname][1]))
p = (v-minv)/(maxv-minv)
c=compose(c,compose(canvas(template),ESRI2Compose(row["Shape"][1]),linewidth(0.05mm),stroke(color("black")),fill(grad(p))))
end
c
end
Out[16]:
In [17]:
plotData(data2,"land_area",dims2,log10)
Out[17]:
Now it's your turn. Pick your favorite data set from the census website and ee what you can com up with. As one more example that you can modify, here is Population/Area:
In [17]:
get_file("http://www2.census.gov/prod2/statcomp/usac/excel/PST01.xls","PST01.xls");
In [18]:
PST = Taro.readxl("PST01.xls","Sheet4","A1:AP3199");
PST[1:5,1:10]
Out[18]:
In [19]:
popdata = PST[["STCOU","PST045200D"]]
rename!(popdata,"STCOU","FIPS")
rename!(popdata,"PST045200D","population")
popdata["FIPS"] = int(popdata["FIPS"])
data3 = join(join(counties2, popdata; on = "FIPS"), landdata; on="FIPS")
data3["pop/area"] = data3["population"]./data3["land_area"];
data3[1:10,:]
Out[19]:
In [20]:
plotData(data3,"pop/area",dims2,log10)
Out[20]:
In [21]: