In [1]:
using DataFrames


INFO: Precompiling module DataFrames...

In [9]:
df = readtable("data.csv");

DataFrames

DataFrame Methods

There are various simple methods you can use to inspect a DataFrame


In [10]:
size(df)


Out[10]:
(250000,20)

In [11]:
names(df)


Out[11]:
20-element Array{Symbol,1}:
 :timestamp             
 :page_group            
 :geo_cc                
 :geo_rg                
 :geo_org               
 :geo_netspeed          
 :user_agent_family     
 :user_agent_major      
 :user_agent_minor      
 :user_agent_os         
 :user_agent_osversion  
 :user_agent_device_type
 :user_agent_model      
 :params_dom_sz         
 :params_dom_ln         
 :params_dom_script     
 :params_dom_img        
 :timers_t_done         
 :timers_t_resp         
 :timers_t_page         

Each column of a DataFrame is a DataArray

You can reference a column using the column name as a Symbol subscript. A DataArray is just a regular array that can contain NA, which is Juliaspeak for NULL.


In [12]:
df[:timers_t_done]


Out[12]:
250000-element DataArrays.DataArray{Int64,1}:
  6257
  5955
 14750
 12266
 10773
  6604
  3502
  6073
  6554
  6546
 12472
 10238
 10995
     ⋮
  7827
  6673
  6624
  6189
  4699
  2968
  7348
  7265
  8626
  3756
  3836
  4439

In [13]:
df[30:40, :timers_t_done]


Out[13]:
11-element DataArrays.DataArray{Int64,1}:
  2857
  3056
  5124
  3188
  4841
  4680
  4879
  6106
  4516
  5557
 12049

In [14]:
df[30:40, [:timestamp, :geo_cc, :geo_netspeed, :user_agent_family, :timers_t_done]]


Out[14]:
timestampgeo_ccgeo_netspeeduser_agent_familytimers_t_done
11455611221592PLNAIE2857
21455611283782PLNAIE3056
31455611355679PLNAIE5124
41455612940770GBCable/DSLSafari3188
51455613685100IEDialupIE4841
61455613730994IEDialupIE4680
71455614657272UANAFirefox4879
81455614335263UANAFirefox6106
91455614452250UANAFirefox4516
101455612605862GBCable/DSLSafari5557
111455613527347SANAChrome12049

Stats on DataFrames

Most Julia stats functions run on AbstractArray, which is the base type for Array as well as DataArray, so you can run them on any column of a DataFrame that contains numbers. You will probably need to remove NAs first using the dropna function.

Our test dataset doesn't contain any NA values for the timers_t_done column, so we're safe.


In [15]:
summarystats(df[:timers_t_done])


Out[15]:
Summary Stats:
Mean:         5858.974556
Minimum:      8.000000
1st Quartile: 2357.000000
Median:       3973.000000
3rd Quartile: 6688.000000
Maximum:      2536087.000000

Histograms

The hist function will by default split the dataset into equal sized buckets based on the data's range. This may not always be what you want, so you can pass in a list of thresholds as the second parameter.

The hist function returns a tuple. The first element is the thresholds used, which might be a Range object or an Array. The second element is the list of bucket frequencies.


In [16]:
hist(df[:timers_t_done])


Out[16]:
(0.0:200000.0:2.6e6,[249866,84,44,2,1,1,0,1,0,0,0,0,1])

Creating thresholds based on the data

We could use static thresholds, but that wouldn't adapt to different data sets. In this case, we develop a Julia function that determines thresholds based on the dataset.

Rather than divide the entire range into a fixed set of buckets, we divide the Inter-Quartile Range. This has the advantage of excluding outliers from the basic range. We then include outliers in their own buckets, one for the low bound and one for the high bound.

This is very similar to a box and whiskers plot.


In [17]:
# Function to set histogram thresholds after dropping outliers based on IQR
function getSymmetricThresholds(results::DataFrame; timer::Symbol=:timers_t_done)
    summary = summarystats(results[timer])
    fw  = (summary.q75-summary.q25)*1.5

    low = round(Int64, max(summary.min, summary.q25-fw))
    high = round(Int64, min(summary.max, summary.q75+fw))+1

    thresholds::Array{Int64, 1} = []

    nthresholds=25

    range = high - low

    for i in 0:nthresholds-1
        push!(thresholds, round(Int64, low + i * range/nthresholds))
    end

    push!(thresholds, high)
    if high < round(Int64, summary.max)
        push!(thresholds, round(Int64, summary.max))
    end

    return thresholds
end


Out[17]:
getSymmetricThresholds (generic function with 1 method)

Julia Functions

Notice that Julia functions are declared using the function keyword. Function parameters may have types attached to them, this is optional, and mainly useful when you overload function names.

Functions may have optional parameters, a ; separates required parameters from optional ones.

When passing optional parameters to a function, they need to be passed by name, and order doesn't matter.

A function typically only returns a single value, though that value may be a tuple of multiple objects. The caller can then receive the return value into a single tuple or multiple values enclosed in ().


In [18]:
thresholds = getSymmetricThresholds(df)


Out[18]:
27-element Array{Int64,1}:
       8
     535
    1062
    1589
    2116
    2643
    3170
    3698
    4225
    4752
    5279
    5806
    6333
       ⋮
    7914
    8441
    8968
    9495
   10023
   10550
   11077
   11604
   12131
   12658
   13185
 2536087

Running the hist function using our new thresholds gets us much better granularity into the data.


In [19]:
hist_global = hist(df[:timers_t_done], thresholds)[2]


Out[19]:
26-element Array{Int64,1}:
   252
  6337
 19357
 25199
 24620
 21891
 18662
 16302
 14786
 12989
 11284
  9803
  8566
  7237
  6349
  5424
  4757
  4204
  3728
  3096
  2669
  2429
  2098
  1729
  1529
 14702

Filtering DataFrames

We can also filter a DataFrame on the value of one or more fields. In the following example, we filter on all :geo_rg that are not NA and equal to US:: OR.


In [22]:
results_US = df[!isna(df[:geo_cc]) & (df[:geo_cc] .== "US"), :];

In [23]:
hist_US = hist(results_US[:timers_t_done], thresholds)[2]


Out[23]:
26-element Array{Int64,1}:
   249
  6243
 19010
 24561
 23830
 21014
 17859
 15578
 14051
 12363
 10758
  9309
  8103
  6823
  6006
  5101
  4462
  3903
  3483
  2896
  2473
  2265
  1926
  1584
  1419
 13249

Statistical Correlation

The cor function lets us run a correlation between the two histograms that we have


In [24]:
cor(hist_global, hist_US)


Out[24]:
0.9995864880638793

We could also run cumsum to generate the CDF from the histogram and correlate those values.


In [25]:
cor(cumsum(hist_global), cumsum(hist_US))


Out[25]:
0.9999733633808021

Splitting/Grouping a DataFrame

Use the by function to run an aggregation on a DataFrame grouped by one or more columns


In [26]:
by(df, :user_agent_family, rows -> median(rows[:timers_t_done]))


Out[26]:
user_agent_familyx1
1(Unknown)3740.0
2AOL4857.0
3Amazon Silk7599.0
4Android Browser11886.0
5BlackBerry WebKit8684.0
6Chrome3129.0
7Chrome Frame4067.0
8Chrome Mobile6776.0
9Chrome Mobile iOS4257.0
10Chromium2772.0
11Edge3187.0
12Firefox3412.0
13Firefox Alpha3747.0
14Firefox Beta10278.0
15Firefox Mobile8456.0
16Halebot4654.0
17IE2862.0
18IE Mobile6265.5
19Iron2738.0
20Maxthon3961.0
21Mobile Safari4147.0
22Nokia Services (WAP) Browser11402.0
23Opera12211.0
24Opera Coast6091.0
25Opera Mini3835.0
26Opera Mobile12235.0
27Other4855.5
28Pale Moon (Firefox Variant)4797.5
29PhantomJS1660.5
30Puffin922.0
&vellip&vellip&vellip

Problems if the aggregation function returns an array

If the aggregation function returns an array, like the hist function does, then we'll actually end up with one row per array element. Instead we need to serialize the array to a string or create a custom data type that encapsulates the array. The string method is easier albeit a little slower, but if we're going to export our data to JavaScript, we may need to do this anyway.


In [27]:
by(
    df,
    :user_agent_family, 
    rows -> DataFrame(
        count = size(rows, 1),
        median = median(rows[:timers_t_done]),
        hist = JSON.json(hist(rows[:timers_t_done], thresholds)[2])
    )
)


Out[27]:
user_agent_familycountmedianhist
1(Unknown)733740.0[0,1,8,9,6,5,7,5,0,2,1,2,4,0,0,3,1,5,3,1,2,2,0,1,0,5]
2AOL154857.0[0,0,1,1,1,1,1,0,2,2,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,2]
3Amazon Silk23237599.0[0,0,0,1,15,39,84,117,133,147,145,165,155,113,141,113,108,90,78,70,67,76,53,49,49,315]
4Android Browser175211886.0[0,1,0,0,0,3,17,29,43,50,66,63,76,65,80,46,64,55,52,48,44,41,52,46,32,779]
5BlackBerry WebKit438684.0[0,0,0,0,1,1,1,0,2,1,2,1,3,3,4,1,3,0,2,1,0,2,2,0,1,12]
6Chrome530863129.0[65,2116,6133,7217,6356,5033,3894,3216,2735,2225,1788,1558,1398,1164,993,872,711,606,563,464,437,342,322,275,246,2357]
7Chrome Frame374067.0[0,0,1,4,4,5,3,4,4,2,1,0,1,0,2,0,1,1,0,1,0,1,0,0,0,2]
8Chrome Mobile314776776.0[0,1,23,60,230,594,1106,1698,2232,2663,2709,2494,2268,2082,1793,1607,1356,1140,997,835,671,612,504,402,369,3031]
9Chrome Mobile iOS19874257.0[0,22,86,184,182,198,179,132,107,101,82,69,60,48,42,36,37,39,39,37,26,21,18,31,17,194]
10Chromium52772.0[0,0,0,0,2,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]
11Edge61503187.0[3,151,632,843,800,633,513,434,353,272,225,182,149,122,104,88,72,74,54,42,62,39,43,26,24,210]
12Firefox119843412.0[14,372,1051,1392,1398,1255,1085,832,803,575,548,383,322,267,220,156,148,115,134,99,74,82,64,38,52,505]
13Firefox Alpha13747.0[0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
14Firefox Beta910278.0[0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,1,3]
15Firefox Mobile2048456.0[0,0,0,0,2,4,5,8,8,9,12,13,11,13,6,10,9,10,13,7,6,10,8,7,3,30]
16Halebot104654.0[0,0,0,2,1,1,0,0,2,1,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,0]
17IE353602862.0[30,1305,4636,5553,4614,3543,2714,2141,1811,1395,1151,971,782,688,581,412,350,335,274,253,201,186,156,106,93,1079]
18IE Mobile2266265.5[0,0,0,0,0,2,6,10,15,18,34,30,14,15,10,6,13,5,7,2,4,4,1,2,1,27]
19Iron102738.0[0,0,0,1,3,3,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
20Maxthon63961.0[0,0,0,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0]
21Mobile Safari877754147.0[32,1018,4282,7212,8901,9018,7763,6621,5638,4906,4083,3472,2948,2340,2165,1890,1706,1601,1398,1157,1018,937,805,704,597,5562]
22Nokia Services (WAP) Browser111402.0[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
23Opera12312211.0[0,0,1,2,2,4,9,6,3,5,4,4,4,1,4,1,4,1,2,0,1,0,3,2,1,59]
24Opera Coast36091.0[0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0]
25Opera Mini73835.0[0,0,0,1,1,0,1,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
26Opera Mobile912235.0[0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,2,0,3]
27Other864855.5[0,0,0,3,3,5,7,8,15,14,4,4,4,1,1,1,1,1,1,1,1,1,1,1,0,8]
28Pale Moon (Firefox Variant)184797.5[0,0,0,1,1,2,0,4,1,1,2,4,0,1,0,0,0,1,0,0,0,0,0,0,0,0]
29PhantomJS781660.5[0,7,27,26,10,7,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
30Puffin5922.0[0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
&vellip&vellip&vellip&vellip&vellip

Copy the JSON to a JavaScript file when first testing D3 code

It's easier to start your D3 experimentation with a standalone file rather than within the IJulia interface. A simpler dev setup is easier to debug.


In [28]:
println("Histogram:\n", JSON.json(hist_global))
println()
println("Thresholds:\n", JSON.json(thresholds))


Histogram:
[252,6337,19357,25199,24620,21891,18662,16302,14786,12989,11284,9803,8566,7237,6349,5424,4757,4204,3728,3096,2669,2429,2098,1729,1529,14702]

Thresholds:
[8,535,1062,1589,2116,2643,3170,3698,4225,4752,5279,5806,6333,6860,7387,7914,8441,8968,9495,10023,10550,11077,11604,12131,12658,13185,2536087]

In [ ]: