Loading data into Julia

Initial setup


In [1]:
#Install packages
Pkg.update()
Pkg.add("JSON")
Pkg.add("PyPlot")


Starting kernel event loops.
INFO: Updating METADATA...
INFO: Updating JSON...
INFO: Computing changes...
INFO: No packages to install, update or remove.
INFO: Nothing to be done.
INFO: Nothing to be done.

Read data from JSON files (can be done in parallel)

This code loads all available geotags from all the tweets and maintains a tally count of them in locations.


In [2]:
addprocs(9-nprocs()) #Do file processing in parallel


Out[2]:
8-element Array{Any,1}:
 2
 3
 4
 5
 6
 7
 8
 9

In [3]:
@everywhere begin
using JSON
import Base.haskey; haskey(a::Nothing, b::ASCIIString) = false
increment!{S,T<:Integer}(dict::Dict{S,T}, key::S, count::T=1)=if haskey(dict, key) dict[key] += count else dict[key] = count end
function parsefile(filename)
    locations = Dict{Array{Any,1},Int}()
    datafile = open(filename)
    n=0
    @time while true
        thisentry = nothing
        try
            thisentry = JSON.parse(datafile)
        catch parse_error
            isa(parse_error, EOFError) && eof(datafile) ? break : throw(parse_error)
        end
        try
            thiscoordinates = thisentry["geo"]["coordinates"]
            n += 1
            #println(n, " ", thisentry["id"]) #DEBUG: Print message id
            increment!(locations, thiscoordinates)
        catch key_error
            isa(key_error, KeyError) || isa(key_error, MethodError) ? continue : throw(key_error)
        end
    end
    println("Loaded ", n, " records from ", datafile.name)
    locations
end
end

filenames = [@sprintf("/data/2012_%02d.json", month) for month=4:11]
locations_all = pmap(parsefile, filenames)

#Collate data from individual files
locations = Dict{Array{Any,1},Int}()
for location_data in locations_all, (coord, count) in location_data
    increment!(locations, coord, count)
end
println("Loaded ", length(locations), " distinct locations")


	From worker 2:	elapsed time: 67.717163116 seconds (6741342276 bytes allocated)
	From worker 2:	Loaded 60131 records from <file /data/2012_04.json>
	From worker 5:	elapsed time: 226.860380545 seconds (31520871140 bytes allocated)
	From worker 5:	Loaded 284912 records from <file /data/2012_07.json>
	From worker 3:	elapsed time: 259.525999493 seconds (25977734952 bytes allocated)
	From worker 3:	Loaded 235374 records from <file /data/2012_05.json>
	From worker 4:	elapsed time: 305.10017051 seconds (27553569072 bytes allocated)
	From worker 4:	Loaded 249783 records from <file /data/2012_06.json>
	From worker 7:	elapsed time: 307.071823714 seconds (40099106452 bytes allocated)
	From worker 7:	Loaded 377012 records from <file /data/2012_09.json>
	From worker 6:	elapsed time: 369.16577103 seconds (35117969924 bytes allocated)
	From worker 6:	Loaded 324518 records from <file /data/2012_08.json>
	From worker 8:	elapsed time: 414.500404976 seconds (47325831500 bytes allocated)
	From worker 8:	Loaded 446216 records from <file /data/2012_10.json>
	From worker 9:	elapsed time: 436.93057881 seconds (51495989732 bytes allocated)
	From worker 9:	Loaded 488078 records from <file /data/2012_11.json>
Loaded 1721888 distinct locations

Analysis 1: Simple mapping

In this analysis, we plot all the locations with at least 50 tweets in the data set.


In [9]:
#Massage data into form suitable for pyplot
latitudes  = Float64[];
longitudes = Float64[];
frequencies= Int[];
for (point, count) in locations
    if count >= 50
        push!(latitudes, point[1])
        push!(longitudes, point[2])
        push!(frequencies, count)
    end
end
length(latitudes)


Out[9]:
1881

This code uses Python's matplotlib and its Basemap data to plot the points on a map of Eastern Massachusetts. The area of each dot is proportional to the number of tweets from that point.


In [25]:
using PyPlot
using PyCall
@pyimport mpl_toolkits.basemap as basemap
scalefactor=30/sqrt(maximum(frequencies))
m=basemap.Basemap(projection="merc", resolution="h",
llcrnrlat=42.18,llcrnrlon=-71.3,
urcrnrlat=42.54,urcrnrlon=-70.825)
m[:drawmapboundary](fill_color="#4771a5")
m[:fillcontinents](color="#555555")
m[:drawcoastlines]()
for i=1:length(longitudes)
    m[:plot](longitudes[i], latitudes[i], "ro",
        markersize=sqrt(frequencies[i])*scalefactor,latlon=true)
end