Accessible data

The function tries to make data from Kenneth R. French's data library accessible.

What the function expects

The function expects to find a .zip file which contains a .txt file of data. Thereby the .txt file starts with a description of the data, and lists several data sets afterwards. The individual data sets are separated by empty lines, and each data set has one header line followed by one or two lines of column names. The first column of the data contains dates given without any separator.

The obstacle for full automation was dealing with the column names, since a single variable name sometimes may consist of two parts separated by whitespace. Hence, it is very difficult to tell automatically, whether two separated strings refer to two different column names or just one single column name.

As an example, the following cell shows an extract of data 6 Portfolios formed on size and momentum (2 x 3) - the comment signs ## at the beginning of each line are not part of the original file, and shall only avoid execution of the lines by julia.


In [1]:
##This file was created by CMPT_ME_PRIOR_RETS using the 201405 CRSP database.
##It contains value- weighted returns for the intersections of  2 ME portfolios
##and  3 prior return portfolios.
##
##The portfolios are constructed monthly.  ME is market cap at the end of the
##previous month.  PRIOR_RET is from -12 to - 2.
##
##Missing data are indicated by -99.99 or -999.
##
##
##  Average Value Weighted Returns -- Monthly
##              Small                 Big         
##          Low     2    High    Low     2    High 
##192701   0.01   3.79   0.39  -0.63   0.23   0.00
##192702   7.13   6.24   5.75   5.59   3.78   4.49
##192703  -3.26  -2.95  -2.30  -7.66  -0.22   2.29
##192704  -0.56  -0.96   3.36  -1.90   0.78   1.89
##192705   2.47  11.39   7.00   4.21   4.87   7.10
##                          .
##                          .
##                          .
##201401  -2.48  -3.55  -2.59  -5.00  -3.35  -1.51
##201402   3.90   4.12   5.49   3.90   4.13   6.62
##201403   0.61   1.50  -1.20   2.02   1.63  -2.72
##201404  -2.45  -3.03  -5.27   2.70   0.73  -2.23
##201405   0.64   0.89  -0.50   0.61   2.40   3.43
##
##
##  Average Equal Weighted Returns -- Monthly
##               Small                 Big         
##          Low     2    High    Low     2    High 
##192701   1.77   3.33  -0.81   0.36   0.62   0.95
##192702   6.82   6.46   6.08   7.93   4.98   5.10
##192703  -4.55  -1.02  -3.56  -4.46  -1.19   0.63
##192704   2.13  -1.05   3.51  -1.74   0.95   1.33
##192705   2.72  11.36   7.54   5.51   4.60   8.43
##192706  -2.86  -1.33  -2.61  -4.19  -1.66  -1.59
##192707   5.32   4.88   6.45   6.01   6.72   6.85

Application

The function needs to be called with some url given as ASCIIString. It returns a tuple consisting of three parts:

  • the actual data sets as Array{Any,1}
  • the description of each individual data set as Array{Symbol,1}
  • the column / variable names as Array{Union(UTF8String,ASCIIString),1})

In [2]:
using EconDatasets

In [3]:
dataUrl = "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/6_Portfolios_ME_Prior_12_2_TXT.zip"
(data, dataNames, varnames) = readFamaFrenchRaw(dataUrl)

(typeof(data), typeof(dataNames), typeof(varnames))


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  103k  100  103k    0     0  94912      0  0:00:01  0:00:01 --:--:-- 95004
Archive:  /tmp/juliaiI5UdL
  inflating: /tmp/6_Portfolios_ME_Prior_12_2.txt  
Out[3]:
(Array{Any,1},Array{Symbol,1},Array{Union(ASCIIString,UTF8String),1})

Data format

Each data set is one entry in an Array{Any,1}. Hence, the number of data sets can be determined with length.


In [4]:
nData = length(data)


Out[4]:
8

Their descriptions can be found in variable dataNames.


In [5]:
dataNames


Out[5]:
8-element Array{Symbol,1}:
 symbol("  Average Value Weighted Returns -- Monthly\r\n")
 symbol("  Average Equal Weighted Returns -- Monthly\r\n")
 symbol("  Average Value Weighted Returns -- Annual\r\n") 
 symbol("  Average Equal Weighted Returns -- Annual\r\n") 
 symbol("  Number of Firms in Portfolios\r\n")            
 symbol("  Average Firm Size\r\n")                        
 symbol("  Equally-Weighted Average of Prior Returns\r\n")
 symbol("  Value-Weighted Average of Prior Returns\r\n")  

Any individual data set is stored as Timematr, with default names for the individual columns.


In [6]:
data[1]


Out[6]:

Timematr{Date}

Dimensions: (1064, 6)

From: 1927-01-31, To: 2015-08-31

idxx1x2x3x4x5x6
11927-01-31-0.093.620.4-0.40.270.0
21927-02-287.266.116.057.273.84.47
31927-03-31-3.38-2.88-2.06-3.63-0.242.23
41927-04-30-0.51-0.563.32-2.720.751.82
51927-05-312.3911.216.825.524.857.11
61927-06-30-2.06-0.44-2.88-3.72-2.12-1.87
71927-07-313.144.296.934.537.849.38
81927-08-310.0-0.440.11.212.13.35
91927-09-302.530.33.193.024.876.22
101927-10-31-4.5-1.8-3.34-2.11-3.39-5.5
111927-11-3011.628.647.025.126.028.35
121927-12-310.375.744.531.231.73.46
131928-01-312.923.470.66-1.93-0.24-0.87
141928-02-29-3.5-1.92-4.77-0.81-1.04-1.58
151928-03-316.457.817.576.755.315.37
161928-04-308.210.425.9811.144.022.05
171928-05-312.762.374.01-1.211.063.24
181928-06-30-9.28-7.31-6.27-5.17-4.05-4.09
191928-07-31-1.67-0.571.99-0.280.781.56
201928-08-314.952.416.294.36.2610.21
211928-09-304.25.337.611.431.895.25
221928-10-310.30.927.290.770.783.88
231928-11-3010.0611.6710.558.7813.4912.15
241928-12-31-2.670.65-1.4-0.470.071.67
251929-01-31-0.112.192.843.885.25.79
261929-02-28-0.121.223.19-1.61-0.831.18
271929-03-31-4.2-5.67-4.070.780.82-1.44
281929-04-301.74-0.651.9-0.941.742.69
291929-05-31-11.22-8.92-11.78-3.41-4.53-6.93
301929-06-304.296.1310.383.087.7214.43
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

Data processing

For a clean end result, one only needs to rename the individual variable names. The variable names can accessed from variable varnames. Note that the function assumes that the column names of all data sets are the same!


In [7]:
varnames


Out[7]:
2-element Array{Union(ASCIIString,UTF8String),1}:
 "               Small                 Big         \r\n"
 "          Low     2    High    Low     2    High \r\n"

As an example, we translate these variable names manually into the following names:


In [8]:
newVarnames = [:SmallLow, :SmallMed, :SmallHigh, :BigLow, :BigMed, :BigHigh]


Out[8]:
6-element Array{Symbol,1}:
 :SmallLow 
 :SmallMed 
 :SmallHigh
 :BigLow   
 :BigMed   
 :BigHigh  

In [9]:
for ii=1:length(data)
    rename!(data[ii].vals, names(data[ii].vals), newVarnames)
end

In [10]:
data[1]


Out[10]:

Timematr{Date}

Dimensions: (1064, 6)

From: 1927-01-31, To: 2015-08-31

idxSmallLowSmallMedSmallHighBigLowBigMedBigHigh
11927-01-31-0.093.620.4-0.40.270.0
21927-02-287.266.116.057.273.84.47
31927-03-31-3.38-2.88-2.06-3.63-0.242.23
41927-04-30-0.51-0.563.32-2.720.751.82
51927-05-312.3911.216.825.524.857.11
61927-06-30-2.06-0.44-2.88-3.72-2.12-1.87
71927-07-313.144.296.934.537.849.38
81927-08-310.0-0.440.11.212.13.35
91927-09-302.530.33.193.024.876.22
101927-10-31-4.5-1.8-3.34-2.11-3.39-5.5
111927-11-3011.628.647.025.126.028.35
121927-12-310.375.744.531.231.73.46
131928-01-312.923.470.66-1.93-0.24-0.87
141928-02-29-3.5-1.92-4.77-0.81-1.04-1.58
151928-03-316.457.817.576.755.315.37
161928-04-308.210.425.9811.144.022.05
171928-05-312.762.374.01-1.211.063.24
181928-06-30-9.28-7.31-6.27-5.17-4.05-4.09
191928-07-31-1.67-0.571.99-0.280.781.56
201928-08-314.952.416.294.36.2610.21
211928-09-304.25.337.611.431.895.25
221928-10-310.30.927.290.770.783.88
231928-11-3010.0611.6710.558.7813.4912.15
241928-12-31-2.670.65-1.4-0.470.071.67
251929-01-31-0.112.192.843.885.25.79
261929-02-28-0.121.223.19-1.61-0.831.18
271929-03-31-4.2-5.67-4.070.780.82-1.44
281929-04-301.74-0.651.9-0.941.742.69
291929-05-31-11.22-8.92-11.78-3.41-4.53-6.93
301929-06-304.296.1310.383.087.7214.43
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip