Introduction to DataFrames in Julia

By Tyler Ransom, Duke University Social Science Research Institute

tyler.ransom@duke.edu

Julia's DataFrames package is largely mirrored after the data.frame() package in R. The key underlying principle is that data frames allow for storage of mixed data types (e.g. strings and numbers) in the same object. Data frames also allow for a missing data type, which is NA in Julia. Other statistical software packages such as SAS, Stata, SPSS, and Matlab offer similar features in their data storage.

This tutorial serves to familiarize Julia users with the primary syntax and capaibilities of Julia's DataFrames package. There will be emphasis on making connections with Stata's syntax and features, but anyone with experience in statistical programming will be able to make connections to their preferred language.

First, let's call the packages we'll need for this demonstration. We'll be using Julia version 0.4.1 with DataFrames version 0.6.10 and FreqTables version 0.0.1.



In [ ]:

    
using DataFrames
using FreqTables

1. Reading in data, summarizing data structure, and browsing

Reading in a delimited text file

Now let's read in some sample data --- the auto dataset from Stata (in CSV form: https://github.com/jmxpearson/duke-julia-ssri-2016/auto.csv). In Julia, the readtable() function converts delimited text files into data frames.

There are a number of options for configuring the read-in operation, but for now we'll use a simple comma-separated file with standard configurations.

Notice that whatever variable name you choose on the left-hand side of the equals sign will be the name of your data frame moving forward.



In [2]:

    
auto = readtable("auto.csv");

Summary of data structure

Next, let's look at the variables that are in our data frame. The showcols function accomplishes this task. This is very similar to Stata's describe command.



In [3]:

    
showcols(auto)









    



74x12 DataFrames.DataFrame
| Col # | Name         | Eltype     | Missing |
|-------|--------------|------------|---------|
| 1     | make         | UTF8String | 0       |
| 2     | price        | Int64      | 0       |
| 3     | mpg          | Int64      | 0       |
| 4     | rep78        | Int64      | 5       |
| 5     | headroom     | Float64    | 0       |
| 6     | trunk        | Int64      | 0       |
| 7     | weight       | Int64      | 0       |
| 8     | length       | Int64      | 0       |
| 9     | turn         | Int64      | 0       |
| 10    | displacement | Int64      | 0       |
| 11    | gear_ratio   | Float64    | 0       |
| 12    | foreign      | Int64      | 0       |

The output of showcols() shows us that we have 74 observations, 12 variables, the name and format of each of our variables, and the number of missing observations for each.

We can also get the length and width of our data frame using the size() function:



In [112]:

    
num_obs  = size(auto,1)









    Out[112]:





74



In [113]:

    
num_vars = size(auto,2)









    Out[113]:





12

Browsing your data

Next, let's look at some of our variables. We do this by either referencing the name with a ":" in front, or with the column number:



In [114]:

    
auto[:price]









    Out[114]:





74-element DataArrays.DataArray{Int64,1}:
  4099
  4749
  3799
  4816
  7827
  5788
  4453
  5189
 10372
  4082
 11385
 14500
 15906
     ⋮
  3995
 12990
  3895
  3798
  5899
  3748
  5719
  7140
  5397
  4697
  6850
 11995



In [115]:

    
auto[2]









    Out[115]:





74-element DataArrays.DataArray{Int64,1}:
  4099
  4749
  3799
  4816
  7827
  5788
  4453
  5189
 10372
  4082
 11385
 14500
 15906
     ⋮
  3995
 12990
  3895
  3798
  5899
  3748
  5719
  7140
  5397
  4697
  6850
 11995



In [116]:

    
auto[:,[:price,:mpg]]









    Out[116]:




price mpg
1 4099 22
2 4749 17
3 3799 22
4 4816 20
5 7827 15
6 5788 18
7 4453 26
8 5189 20
9 10372 16
10 4082 19
11 11385 14
12 14500 14
13 15906 21
14 3299 29
15 5705 16
16 4504 22
17 5104 22
18 3667 24
19 3955 19
20 3984 30
21 4010 18
22 5886 16
23 6342 17
24 4389 28
25 4187 21
26 11497 12
27 13594 12
28 13466 14
29 3829 22
30 5379 14
&vellip &vellip &vellip

We can also use the head() and tail() functions to view the first k and last k observations for all variables in our data frame:



In [117]:

    
head(auto,4)









    Out[117]:




make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
1 AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
2 AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
3 AMC Spirit 3799 22 NA 3.0 12 2640 168 35 121 3.08 0
4 Buick Century 4816 20 3 4.5 16 3250 196 40 196 2.93 0



In [118]:

    
tail(auto,4)









    Out[118]:




make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
1 VW Diesel 5397 41 5 3.0 15 2040 155 35 90 3.78 1
2 VW Rabbit 4697 25 4 3.0 15 1930 155 35 89 3.78 1
3 VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
4 Volvo 260 11995 17 5 2.5 14 3170 193 37 163 2.98 1

We can also list observations of certain variables indexed by their row number:



In [119]:

    
auto[[1;2;4;15],[:headroom,:trunk]]









    Out[119]:




headroom trunk
1 2.5 11
2 3.0 11
3 4.5 16
4 4.0 20

We can also list observations that meet some condition. For example, suppose we want to look at the headroom and trunk space for all cars that achieve less than 20 miles per gallon:



In [120]:

    
auto[(auto[:,:mpg].<20),[:headroom,:trunk]]









    Out[120]:




headroom trunk
1 3.0 11
2 4.0 20
3 4.0 21
4 3.5 17
5 3.5 13
6 4.0 20
7 3.5 16
8 4.0 20
9 3.5 13
10 4.0 17
11 4.0 17
12 4.5 21
13 3.5 22
14 2.5 18
15 3.5 15
16 3.5 16
17 3.5 23
18 3.0 15
19 3.0 16
20 2.0 16
21 4.5 16
22 4.0 20
23 4.5 14
24 3.5 17
25 5.0 16
26 4.0 20
27 1.5 7
28 2.0 16
29 3.5 17
30 3.5 13
&vellip &vellip &vellip

2. Summary statistics

We can look at summary statistics in a few different ways. First, notice that the showcols() function reported the number of NA or missing values for each variable.

The describe() function also displays missing value frequencies and percentages, in addition to reporting the min/max, mean, median, number of unique observations, and quartiles for each variable in the data frame:



In [4]:

    
describe(auto)









    



make
Length  74
Type    UTF8String
NAs     0
NA%     0.0%
Unique  74

price
Min      3291.0
1st Qu.  4220.25
Median   5006.5
Mean     6165.256756756757
3rd Qu.  6332.25
Max      15906.0
NAs      0
NA%      0.0%

mpg
Min      12.0
1st Qu.  18.0
Median   20.0
Mean     21.2972972972973
3rd Qu.  24.75
Max      41.0
NAs      0
NA%      0.0%

rep78
Min      1.0
1st Qu.  3.0
Median   3.0
Mean     3.4057971014492754
3rd Qu.  4.0
Max      5.0
NAs      5
NA%      6.76%

headroom
Min      1.5
1st Qu.  2.5
Median   3.0
Mean     2.9932432432432434
3rd Qu.  3.5
Max      5.0
NAs      0
NA%      0.0%

trunk
Min      5.0
1st Qu.  10.25
Median   14.0
Mean     13.756756756756756
3rd Qu.  16.75
Max      23.0
NAs      0
NA%      0.0%

weight
Min      1760.0
1st Qu.  2250.0
Median   3190.0
Mean     3019.4594594594596
3rd Qu.  3600.0
Max      4840.0
NAs      0
NA%      0.0%

length
Min      142.0
1st Qu.  170.0
Median   192.5
Mean     187.93243243243242
3rd Qu.  203.75
Max      233.0
NAs      0
NA%      0.0%

turn
Min      31.0
1st Qu.  36.0
Median   40.0
Mean     39.648648648648646
3rd Qu.  43.0
Max      51.0
NAs      0
NA%      0.0%

displacement
Min      79.0
1st Qu.  119.0
Median   196.0
Mean     197.2972972972973
3rd Qu.  245.25
Max      425.0
NAs      0
NA%      0.0%

gear_ratio
Min      2.19
1st Qu.  2.73
Median   2.955
Mean     3.0148648648648644
3rd Qu.  3.3525
Max      3.89
NAs      0
NA%      0.0%

foreign
Min      0.0
1st Qu.  0.0
Median   0.0
Mean     0.2972972972972973
3rd Qu.  1.0
Max      1.0
NAs      0
NA%      0.0%

We can also display the mean for each variable using the colwise() function. However, this function will return an error if we include string variables.



In [122]:

    
colwise(mean,auto)









    



LoadError: MethodError: `+` has no method matching +(::UTF8String, ::UTF8String)
Closest candidates are:
  +(::Any, ::Any, !Matched::Any, !Matched::Any...)
while loading In[122], in expression starting on line 1

 in mapreduce_seq_impl at reduce.jl:228
 in mapreduce_pairwise_impl at reduce.jl:108
 in _mapreduce at reduce.jl:153
 in mapreduce at C:\Users\tmr17\.julia\v0.4\DataArrays\src\reduce.jl:110
 in mean at C:\Users\tmr17\.julia\v0.4\DataArrays\src\reduce.jl:137
 in colwise at C:\Users\tmr17\.julia\v0.4\DataFrames\src\groupeddataframe\grouping.jl:248



In [171]:

    
round(colwise(mean,auto[:,2:end]),3)









    Out[171]:





11-element Array{Any,1}:
 [6165.257]
 [21.297]  
 [NA]      
 [2.993]   
 [13.757]  
 [3019.459]
 [187.932] 
 [39.649]  
 [197.297] 
 [3.015]   
 [0.297]

Notice that this returns NA for variables with at least one missing observation.

Tabulations and cross-tabulations

We can also compute frequencies of categorical variables, using a couple of different functions:

countmap() returns cell counts as a dictionary:



In [124]:

    
countmap(auto[:foreign])









    Out[124]:





Dict{Union{DataArrays.NAtype,Int64},Int64} with 2 entries:
  0 => 52
  1 => 22

We can also use the by structure, coupled with the nrow function:



In [125]:

    
by(auto,:foreign,nrow)









    Out[125]:




foreign x1
1 0 52
2 1 22

For cross-tabulations, we require the FreqTables package, which was loaded earlier.



In [126]:

    
freqtable(auto, :rep78, :foreign, subset=!isna(auto[:rep78]))









    Out[126]:





5x2 NamedArrays.NamedArray{Int64,2,Array{Int64,2},Tuple{Dict{Int64,Int64},Dict{Int64,Int64}}}
rep78 \ foreign 0  1 
1               2  0 
2               8  0 
3               27 3 
4               9  9 
5               2  9

Notice that, at the moment, the Julia's DataFrames is substantially lagging other languages in terms of computing cross-tabulations and contingency tables.

3. Dropping, keeping, renaming, and generating

Dropping observations and variables

Suppose we want to delete observations in our dataset according to some rule. This amounts to keeping the complement of the rule. For example, if we want to drop all observations of the data frame where a variable is missing, we index the rows we want to keep with !isna() and select all columns:



In [5]:

    
auto1 = auto[!isna(auto[:,:rep78]), :];



In [6]:

    
showcols(auto1)









    



69x12 DataFrames.DataFrame
| Col # | Name         | Eltype     | Missing |
|-------|--------------|------------|---------|
| 1     | make         | UTF8String | 0       |
| 2     | price        | Int64      | 0       |
| 3     | mpg          | Int64      | 0       |
| 4     | rep78        | Int64      | 0       |
| 5     | headroom     | Float64    | 0       |
| 6     | trunk        | Int64      | 0       |
| 7     | weight       | Int64      | 0       |
| 8     | length       | Int64      | 0       |
| 9     | turn         | Int64      | 0       |
| 10    | displacement | Int64      | 0       |
| 11    | gear_ratio   | Float64    | 0       |
| 12    | foreign      | Int64      | 0       |

Notice that we now have 5 fewer observations in the new data frame, and that there are no missing values.

We can drop variables in two different ways:

First, by using the complement of a keep statement:



In [7]:

    
auto1 = auto1[setdiff(names(auto1), [:price,:mpg])];



In [8]:

    
showcols(autod1)









    



69x10 DataFrames.DataFrame
| Col # | Name         | Eltype     | Missing |
|-------|--------------|------------|---------|
| 1     | make         | UTF8String | 0       |
| 2     | rep78        | Int64      | 0       |
| 3     | headroom     | Float64    | 0       |
| 4     | trunk        | Int64      | 0       |
| 5     | weight       | Int64      | 0       |
| 6     | length       | Int64      | 0       |
| 7     | turn         | Int64      | 0       |
| 8     | displacement | Int64      | 0       |
| 9     | gear_ratio   | Float64    | 0       |
| 10    | foreign      | Int64      | 0       |

Second, we can drop in-place using the delete!() function, which overwrites the data frame.



In [139]:

    
delete!(auto1,[:weight,:length]);



In [140]:

    
showcols(auto1)









    



69x8 DataFrames.DataFrame
| Col # | Name         | Eltype     | Missing |
|-------|--------------|------------|---------|
| 1     | make         | UTF8String | 0       |
| 2     | rep78        | Int64      | 0       |
| 3     | headroom     | Float64    | 0       |
| 4     | trunk        | Int64      | 0       |
| 5     | turn         | Int64      | 0       |
| 6     | displacement | Int64      | 0       |
| 7     | gear_ratio   | Float64    | 0       |
| 8     | foreign      | Int64      | 0       |

Keeping variables

We can keep variables simply by indexing the variable names or column numbers of interest:



In [141]:

    
auto2 = auto[:,[:make,:mpg,:displacement,:gear_ratio]];



In [142]:

    
showcols(auto2)









    



74x4 DataFrames.DataFrame
| Col # | Name         | Eltype     | Missing |
|-------|--------------|------------|---------|
| 1     | make         | UTF8String | 0       |
| 2     | mpg          | Int64      | 0       |
| 3     | displacement | Int64      | 0       |
| 4     | gear_ratio   | Float64    | 0       |

Renaming variables

Suppose we want to rename the variables in the "kept" data frame from directly above. This is easily accomplished with the rename!() function:



In [143]:

    
rename!(auto2,[:make,:displacement],[:make_name,:CCs]);



In [144]:

    
showcols(auto2)









    



74x4 DataFrames.DataFrame
| Col # | Name       | Eltype     | Missing |
|-------|------------|------------|---------|
| 1     | make_name  | UTF8String | 0       |
| 2     | mpg        | Int64      | 0       |
| 3     | CCs        | Int64      | 0       |
| 4     | gear_ratio | Float64    | 0       |

Generating new variables

Cloning variables

Cloning a variable is easily done as follows:



In [146]:

    
auto2[:mpg_same] = auto2[:mpg];



In [147]:

    
showcols(auto2)









    



74x5 DataFrames.DataFrame
| Col # | Name       | Eltype     | Missing |
|-------|------------|------------|---------|
| 1     | make_name  | UTF8String | 0       |
| 2     | mpg        | Int64      | 0       |
| 3     | CCs        | Int64      | 0       |
| 4     | gear_ratio | Float64    | 0       |
| 5     | mpg_same   | Int64      | 0       |

Generating new variables using functions of existing variables

To generate a new variable using a function of one or more existing variables, the syntax is a bit more involved. For instance, suppose we want to create a new variable called mpgSquared, which is equal to mpg squared:



In [148]:

    
auto2[:mpgSquared] = map(temp -> temp.^2, auto[:mpg]);

We use map() to accomplish the task, which takes as arguments a function (using arrow notation) and an input (auto[:mpg]). Note that the argument of the function (temp, here) can be any name. I used temp to emphasize that it is a variable of local scope, and thus purley temporary.

Finally, note that there is a "." before the caret symbol, indicating that this is a vectorized operation. Failure to include the "." will result in an error.

We can verify that the function worked as expected:



In [149]:

    
head(auto2[:,[:mpg,:mpgSquared]])









    Out[149]:




mpg mpgSquared
1 22 484
2 17 289
3 22 484
4 20 400
5 15 225
6 18 324

We can use this framework to generate new variables using any mathematical function. For example, a dummy variable that is equal to 1 if :mpg is less than 20 and :gear_ratio is less than 3, and 0 otherwise:



In [157]:

    
auto2[:dummy_var] = map((tempx,tempy) -> (tempx.<20) & (tempy.<3), auto2[:mpg], auto2[:gear_ratio]);



In [158]:

    
showcols(auto2)









    



74x7 DataFrames.DataFrame
| Col # | Name       | Eltype     | Missing |
|-------|------------|------------|---------|
| 1     | make_name  | UTF8String | 0       |
| 2     | mpg        | Int64      | 0       |
| 3     | CCs        | Int64      | 0       |
| 4     | gear_ratio | Float64    | 0       |
| 5     | mpg_same   | Int64      | 0       |
| 6     | mpgSquared | Any        | 0       |
| 7     | dummy_var  | Bool       | 0       |

Note that the type of the new dummy variable is Bool instead of Int64.

4. Ordering and Sorting

Ordering columns

Suppose we want to change the ordering of the variables of our data frame. This is most easily done as follows:



In [159]:

    
auto2 = auto2[:,[2;3;4;1;5:end]];



In [160]:

    
showcols(auto2)









    



74x7 DataFrames.DataFrame
| Col # | Name       | Eltype     | Missing |
|-------|------------|------------|---------|
| 1     | mpg        | Int64      | 0       |
| 2     | CCs        | Int64      | 0       |
| 3     | gear_ratio | Float64    | 0       |
| 4     | make_name  | UTF8String | 0       |
| 5     | mpg_same   | Int64      | 0       |
| 6     | mpgSquared | Any        | 0       |
| 7     | dummy_var  | Bool       | 0       |

Sorting observations by various variables

We can also sort the observations in our data frame by any number of columns and any number of methods (ascending or descending). Performance of the sort!() function closely mirrors Stata's gsort capabilities.

Below, we will sort ascending by :mpg and descending by :make_name:



In [169]:

    
sort!(auto2,cols=[:mpg,:make_name],rev=[false,true]);



In [170]:

    
head(auto2,4)









    Out[170]:




mpg CCs gear_ratio make_name mpg_same mpgSquared dummy_var
1 12 163 2.98 Volvo 260 12 1681 false
2 12 97 3.78 VW Scirocco 12 1225 false
3 14 89 3.78 VW Rabbit 14 1225 false
4 14 90 3.78 VW Diesel 14 1156 false

5. Reshaping and Merging

Reshaping

Julia's DataFrames allows for reshaping of longitudinal datasets in a similar fashion as other statistical software programs.

Let's start by hand-creating a "wide" panel dataset with 3 individuals and 3 time periods:



In [173]:

    
reshape1 = DataFrame(id = 1:3, sex = [0;1;0], 
                     inc1980 = [5000;2000;3000],
                     inc1981 = [5500;2200;2000],
                     inc1982 = [6000;4400;1000])









    Out[173]:




id sex inc1980 inc1981 inc1982
1 1 0 5000 5500 6000
2 2 1 2000 2200 4400
3 3 0 3000 2000 1000

We can reshape this data frame to "long" format by using the stack() command provided by the DataFrames package:



In [174]:

    
longform1A = stack(reshape1, [:inc1980, :inc1981, :inc1982], [:id, :sex])









    Out[174]:




variable value id sex
1 inc1980 5000 1 0
2 inc1980 2000 2 1
3 inc1980 3000 3 0
4 inc1981 5500 1 0
5 inc1981 2200 2 1
6 inc1981 2000 3 0
7 inc1982 6000 1 0
8 inc1982 4400 2 1
9 inc1982 1000 3 0

Now we have three replications of each :id and :sex (the time-invariant columns), as well as two new columns, labeled :variable and :value.

We can sort the new dataframe so that it is in a more readable format:



In [175]:

    
sort!(longform1A, cols = [:id, :variable])









    Out[175]:




variable value id sex
1 inc1980 5000 1 0
2 inc1981 5500 1 0
3 inc1982 6000 1 0
4 inc1980 2000 2 1
5 inc1981 2200 2 1
6 inc1982 4400 2 1
7 inc1980 3000 3 0
8 inc1981 2000 3 0
9 inc1982 1000 3 0

And we can also reshape back to "wide" format using the unstack() function:



In [176]:

    
wideform1A = unstack(longform1A, :variable, :value)









    Out[176]:




id sex inc1980 inc1981 inc1982
1 1 0 5000 5500 6000
2 2 1 2000 2200 4400
3 3 0 3000 2000 1000

It's worth noting that this method does not work very well when there are multiple time-varying variables per :id. We'll discuss this in detail a bit later.

Merging

DataFrames also has functions that allow the user to merge two data frames together. There are many different types of possible merges, all accessible via the join() function.

The different types of merges depend on if the user wants to keep unmatched observations from either data frame, not on whether or not the identification is duplicated in the merging data frames (i.e. each type of merge can be used for both one-to-one merges and many-to-one merges).

The basic syntax is c = join(a, b, on = [:id1, :id2], kind = symbol), where a and b are data frames each with the identifiers :id1 and id2, and kind is a symbol that can take any of the following 7 values:

Let's show how to do each of these merges using a simple set of data frames.

:inner: The output contains rows for values of the key that exist in both the first (left) and second (right) arguments to join (this is the keep(match) option in Stata)
:left: The output contains rows for values of the key that exist in the first (left) argument to join, whether or not that value exists in the second (right) argument (this is the keep(master) option in Stata)
:right: The output contains rows for values of the key that exist in the second (right) argument to join, whether or not that value exists in the first (left) argument (this is the keep(using) option in Stata)
:outer: The output contains rows for values of the key that exist in the first (left) or second (right) argument to join (this is the Stata default)

:semi: Like an inner join, but output is restricted to columns from the first (left) argument to join (this is the keep(match master) option in Stata)
:anti: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument to join. As with semi joins, output is restricted to columns from the first (left) argument (there is no natural stata equivalent for this)
:cross: The output is the cartesian product of rows from the first (left) and second (right) arguments to join (this is equivalent to Stata's append command). Note also that :cross is the only merge type that does not require an identifier in each data frame

Let's show how to do each of these merges using a simple set of data frames.



In [178]:

    
name = DataFrame(ID = [1, 2, 3, 4, 5, 6], Name = ["John", "Jane", "Mark", "Ann", "Vlad", "Maria"])









    Out[178]:




ID Name
1 1 John
2 2 Jane
3 3 Mark
4 4 Ann
5 5 Vlad
6 6 Maria



In [179]:

    
jobs = DataFrame(ID = [1, 2, 3, 4, 5, 6], Job = ["Lawyer", "Doctor", "Mechanic", "Doctor", "Judge", "Pilot"])









    Out[179]:




ID Job
1 1 Lawyer
2 2 Doctor
3 3 Mechanic
4 4 Doctor
5 5 Judge
6 6 Pilot



In [186]:

    
siblings = DataFrame(ID = [1, 1, 2, 3, 5, 5, 5, 6],
           Sibling = ["Eric", "Ryan", "Jennifer", "Heather", "Carl", "Dmitri", "Andrei", "Pedro"])









    Out[186]:




ID Sibling
1 1 Eric
2 1 Ryan
3 2 Jennifer
4 3 Heather
5 5 Carl
6 5 Dmitri
7 5 Andrei
8 6 Pedro

Let's do a simple :inner merge on the first name and jobs data frames:



In [183]:

    
mergedNameJobs = join(name,jobs, on = :ID, kind = :inner)









    Out[183]:




ID Name Job
1 1 John Lawyer
2 2 Jane Doctor
3 3 Mark Mechanic
4 4 Ann Doctor
5 5 Vlad Judge
6 6 Maria Pilot

Now let's see what happens when we merge name with siblings, under a variety of join types:



In [184]:

    
mergedNameSibsInner = join(name,siblings, on = :ID, kind = :inner)









    Out[184]:




ID Name Sibling
1 1 John Eric
2 1 John Ryan
3 2 Jane Jennifer
4 3 Mark Heather
5 5 Vlad Carl
6 5 Vlad Dmitri
7 5 Vlad Andrei
8 6 Maria Pedro

With the :inner join, those who don't have siblings are removed from the merged data frame.



In [185]:

    
mergedNameSibsOuter = join(name,siblings, on = :ID, kind = :outer)









    Out[185]:




ID Name Sibling
1 1 John Eric
2 1 John Ryan
3 2 Jane Jennifer
4 3 Mark Heather
5 5 Vlad Carl
6 5 Vlad Dmitri
7 5 Vlad Andrei
8 6 Maria Pedro
9 4 Ann NA

When we instead do an :outer join, we see that Ann, who doesn't have any siblings, shows as NA under :Sibling.

Other less-common merge types:



In [187]:

    
mergedNameSibsLeft = join(name,siblings, on = :ID, kind = :left)









    Out[187]:




ID Name Sibling
1 1 John Eric
2 1 John Ryan
3 2 Jane Jennifer
4 3 Mark Heather
5 5 Vlad Carl
6 5 Vlad Dmitri
7 5 Vlad Andrei
8 6 Maria Pedro
9 4 Ann NA



In [188]:

    
mergedNameSibsOuter = join(name,siblings, on = :ID, kind = :right)









    Out[188]:




Name ID Sibling
1 John 1 Eric
2 John 1 Ryan
3 Jane 2 Jennifer
4 Mark 3 Heather
5 Vlad 5 Carl
6 Vlad 5 Dmitri
7 Vlad 5 Andrei
8 Maria 6 Pedro



In [191]:

    
mergedNameSibsSemi = join(name,siblings, on = :ID, kind = :semi)









    Out[191]:




ID Name
1 1 John
2 2 Jane
3 3 Mark
4 5 Vlad
5 6 Maria



In [192]:

    
mergedNameSibsAnti = join(name,siblings, on = :ID, kind = :anti)









    Out[192]:




ID Name
1 4 Ann



In [193]:

    
mergedNameSibsCross = join(name,siblings, kind = :cross)









    Out[193]:




ID Name ID_1 Sibling
1 1 John 1 Eric
2 1 John 1 Ryan
3 1 John 2 Jennifer
4 1 John 3 Heather
5 1 John 5 Carl
6 1 John 5 Dmitri
7 1 John 5 Andrei
8 1 John 6 Pedro
9 2 Jane 1 Eric
10 2 Jane 1 Ryan
11 2 Jane 2 Jennifer
12 2 Jane 3 Heather
13 2 Jane 5 Carl
14 2 Jane 5 Dmitri
15 2 Jane 5 Andrei
16 2 Jane 6 Pedro
17 3 Mark 1 Eric
18 3 Mark 1 Ryan
19 3 Mark 2 Jennifer
20 3 Mark 3 Heather
21 3 Mark 5 Carl
22 3 Mark 5 Dmitri
23 3 Mark 5 Andrei
24 3 Mark 6 Pedro
25 4 Ann 1 Eric
26 4 Ann 1 Ryan
27 4 Ann 2 Jennifer
28 4 Ann 3 Heather
29 4 Ann 5 Carl
30 4 Ann 5 Dmitri
&vellip &vellip &vellip &vellip &vellip

Reshaping with multiple time-varying variables

I mentioned previously that the reshaping method outlined previously does not work very well when there are multiple time-varying variables per :id. With the join() functions in hand, this is possible, though not ideal compared to other software packages.

Let's revisit our previous example, except now with two time-varying variables (inc* and ue*):



In [194]:

    
reshape2 = DataFrame(id = 1:3, sex = [0;1;0], inc1980 = [5000;2000;3000], 
                     inc1981 = [5500;2200;2000],inc1982 = [6000;4400;1000],
                    ue1980 = [0;1;0], ue1981 = [1;0;0], ue1982 = [0;0;1])









    Out[194]:




id sex inc1980 inc1981 inc1982 ue1980 ue1981 ue1982
1 1 0 5000 5500 6000 0 1 0
2 2 1 2000 2200 4400 1 0 0
3 3 0 3000 2000 1000 0 0 1

If we try to reshape this using a similar stack() call as before, we get:



In [195]:

    
longform2 = stack(reshape2, [:inc1980, :inc1981, :inc1982, :ue1980, :ue1981, :ue1982],
                 [:id, :sex])









    Out[195]:




variable value id sex
1 inc1980 5000 1 0
2 inc1980 2000 2 1
3 inc1980 3000 3 0
4 inc1981 5500 1 0
5 inc1981 2200 2 1
6 inc1981 2000 3 0
7 inc1982 6000 1 0
8 inc1982 4400 2 1
9 inc1982 1000 3 0
10 ue1980 0 1 0
11 ue1980 1 2 1
12 ue1980 0 3 0
13 ue1981 1 1 0
14 ue1981 0 2 1
15 ue1981 0 3 0
16 ue1982 0 1 0
17 ue1982 0 2 1
18 ue1982 1 3 0

The inc* and ue* values are stacked, so that we have double the number of observations we would like to have.

The remedy for this is to do the reshaping separately for each type of variable, and then merge together.

Converting data frames to regular Julia arrays

Conversion from data frames to regular Julia arrays may be required for use of libraries outside of the DataFrames and GLM world.

To convert, simply type

arrayName = convert(Array,dataFrameName)

But be aware that any NA elements of the data frame will cause an error to be thrown (because Julia's regular arrays do not know the NA type).

	price	mpg
1	4099	22
2	4749	17
3	3799	22
4	4816	20
5	7827	15
6	5788	18
7	4453	26
8	5189	20
9	10372	16
10	4082	19
11	11385	14
12	14500	14
13	15906	21
14	3299	29
15	5705	16
16	4504	22
17	5104	22
18	3667	24
19	3955	19
20	3984	30
21	4010	18
22	5886	16
23	6342	17
24	4389	28
25	4187	21
26	11497	12
27	13594	12
28	13466	14
29	3829	22
30	5379	14
&vellip	&vellip	&vellip

	make	price	mpg	rep78	headroom	trunk	weight	length	turn	displacement	gear_ratio
1	AMC Concord	4099	22	3	2.5	11	2930	186	40	121	3.58
2	AMC Pacer	4749	17	3	3.0	11	3350	173	40	258	2.53
3	AMC Spirit	3799	22	NA	3.0	12	2640	168	35	121	3.08
4	Buick Century	4816	20	3	4.5	16	3250	196	40	196	2.93

	make	price	mpg	rep78	headroom	trunk	weight	length	turn	displacement	gear_ratio	foreign
1	VW Diesel	5397	41	5	3.0	15	2040	155	35	90	3.78	1
2	VW Rabbit	4697	25	4	3.0	15	1930	155	35	89	3.78	1
3	VW Scirocco	6850	25	4	2.0	16	1990	156	36	97	3.78	1
4	Volvo 260	11995	17	5	2.5	14	3170	193	37	163	2.98	1

	mpg	CCs	gear_ratio	make_name	mpg_same	mpgSquared	dummy_var
1	12	163	2.98	Volvo 260	12	1681	false
2	12	97	3.78	VW Scirocco	12	1225	false
3	14	89	3.78	VW Rabbit	14	1225	false
4	14	90	3.78	VW Diesel	14	1156	false

	variable	value	id	sex
1	inc1980	5000	1	0
2	inc1980	2000	2	1
3	inc1980	3000	3	0
4	inc1981	5500	1	0
5	inc1981	2200	2	1
6	inc1981	2000	3	0
7	inc1982	6000	1	0
8	inc1982	4400	2	1
9	inc1982	1000	3	0

	ID	Sibling
1	1	Eric
2	1	Ryan
3	2	Jennifer
4	3	Heather
5	5	Carl
6	5	Dmitri
7	5	Andrei
8	6	Pedro

	ID	Name	Job
1	1	John	Lawyer
2	2	Jane	Doctor
3	3	Mark	Mechanic
4	4	Ann	Doctor
5	5	Vlad	Judge
6	6	Maria	Pilot