Douglas Bates
February 24, 2015
Julia, a recently developed language for technical computing, has many favorable features. However, it is still in the early development phase and many techiques for statistical analysis are not yet implemented in Julia
.
In contrast R, the most widely used language in statistics, has a mature infrastructure including access to several thousand user-contributed packages that implement a wide variety of statistical techniques. Obviously this wealth of software is not going to be reimplemented overnight. It is desirable to be able to access R and R packages from within Julia.
Furthermore, one of the admirable properties of R is that it has a sophisticated system for storing data sets and their metadata compactly. It is expected that R packages will provide data sets to illustrate techniques and to provide reference results against which to compare other implementations. Even though a data archival format for Julia is available in the HDF5 package, it does not by itself provide access to all the data sets that already exist in R packages.
The RCall
package for Julia
provides the ability to run an embedded R within Julia. One basic usage is to create a copy in Julia
of a dataset from an R package.
In addition to showing some basic usage of RCall
this notebook describes some of the implementation details.
The Rcall
package is installed by running
Pkg.add("RCall")
In a typical installation the shell commands R RHOME
and Rscript
are called to determine the location of R's shared library and the values of some environment variables. A successful installation requires R
and Rscript
to be on the user's search path.
The RCall
package can be configured to use another version of R
by setting several environment variables; R_HOME
, R_DOC_DIR
, R_INCLUDE_DIR
, R_SHARE_DIR
and LD_LIBRARY_PATH
to appropriate values then running
Pkg.build("RCall")
The RCall
package is attached to the current Julia session with
using RCall
It is a good idea to also attach the DataArrays
and DataFrames
packages.
In [1]:
using DataArrays,DataFrames,RCall
The Rcall
package adds methods for DataFrame
and DataArray
that evaluate a Julia Symbol
or ASCIIString
in R and create the indicated Julia type from the result, if possible.
For example
In [2]:
attenu = DataFrame(:attenu)
Out[2]:
creates a copy in Julia
of the attenu
dataset. It is possible to use a symbol here because the attenu
data frame is in the R datasets
package, which is, by default, loaded at startup.
To access data frames from an R package that is not loaded at startup, you can use the ::
notation in a character string
In [3]:
DataFrame("ggplot2::diamonds")
Out[3]:
or first attach the package's namespace, then use a symbol.
In [4]:
rcopy("library(robustbase)")
Out[4]:
In [5]:
coleman = DataFrame(:coleman)
Out[5]:
The difference between using the ::
operator in R
and evaluating an expression like "library(robustbase)"
is that the latter attaches the namespace of the package to the R search path. The ::
operator accesses the package's contents but does not put it on the search path.
You can also check the search path explicitly
In [6]:
rcopy("search()")
Out[6]:
The RCall
package exports functions rcopy
, reval
, rparse
, rprint
and sexp
, the datatype SEXPREC
, and the globalEnv
object.
In [7]:
whos(RCall)
rcopy
, rprint
and globalEnv
provide the high-level interface.
As seen in the last section, rcopy
evaluates a symbol or a character string in the embedded R. The reason it is called rcopy
is because it evaluates the expression and copies the contents of the return value to storage allocated by Julia.
This is the preferred way to evaluate an R expression because the value of the R expression is copied into storage allocated by Julia and displayed by Julia. Some users prefer to write such calls as a Julia pipe so that the R expression occurs first.
In [8]:
"search()" |> rcopy
Out[8]:
Sometimes the R expression to be evaluated contains a quoted string. It is easiest to use single quotes in the R expression because, in R, single quotes, '
, are equivalent to double quotes, "
. In a Julia literal string double quotes must be escaped but single quotes do not.
In [9]:
"exists('airmiles')" |> rcopy
Out[9]:
The result of rcopy
cannot be used as an argument to other functions in the C API for R. To do this we must preserve the intermediate representation returned by reval
.
Also, because there is not a one-to-one correspondence between R objects and Julia objects, information is often lost when to Julia.
For example, the value of exists('airmiles')
is a logical vector of length 1 in R but logical values in R happen to be stored as 32-bit integers and rcopy
returns this type.
Julia uses types to represent objects with data and structural metadata. In R the structural metadata, such as array dimensions, names or dimension names, are stored as attributes
of the object. rcopy
preserves array dimensions but drops other attributes.
For example, airmiles
is an R time series.
In [10]:
"names(attributes(airmiles))" |> rcopy
Out[10]:
Access to the attributes is available from the result of reval
, the low-level evaluation of R expressions, which is described below.
The rprint
generic applies R's printing methods which, naturally, do use the information in the attributes.
In [11]:
rprint(:airmiles)
Top-level assignments in R are in what is called the global environment. Assignments in Julia to this R environment are written as assignments to names in the globalEnv
object exported by the RCall
package.
In [12]:
globalEnv[:x] = [1:10];
rprint(:x)
In [13]:
"mean(x)" |> rcopy
Out[13]:
In [14]:
"ls()" |> rcopy
Out[14]:
Typing globalEnv
for each assignment can get tedious. Some users prefer to create a shorter alias, such as g
.
In [15]:
const g = globalEnv;
g[:y] = "Hello world";
rprint(:y)
The conversion of a Julia object to an R object is performed by methods for the sexp
generic. If an sexp
method doesn't exist for the Julia object you will see an error of the form
In [16]:
g[:x] = 1:10
Please open an issue or create a pull request on the github repository if you have a reasonable way of representing the Julia type in R.
Base R and various R packages provide a wide variety of model-fitting functions. Frequently these functions return an R list
object with a class
attribute. At present you can't count on rcopy
being able to handle lists that may contain language elements.
In [17]:
"m1 <- lmrob(Y ~ ., coleman)" |> rcopy
Note that the failure is in copying the results of the model fit to Julia. m1
has been evaluated and assigned in R so the usual extractor functions in R can be applied to it. If they return simple objects, rcopy
can be applied.
In [18]:
rprint(:m1)
In [19]:
"coef(m1)" |> rprint
In [20]:
m1coef = rcopy("coef(m1)")
Out[20]:
In [21]:
"coef(summary(m1))" |> rprint
In [22]:
m1coefmat = rcopy("coef(summary(m1))")
Out[22]:
We see that rcopy
drops the names
and dimnames
. Julia Array
objects do not have provision for names. The NamedArrays
package for Julia does provide for named dimensions. Methods for NamedArray
may help.
Missing values are indicated by a sentinel value in R vectors. The sentinel for Float64
vectors is one of the NaN
values and the sentinel for integer or logical vectors and for factors is typemin(Int32)
In [23]:
RCall.R_NaReal
Out[23]:
In [24]:
reinterpret(Uint64,ans)
Out[24]:
In [25]:
RCall.R_NaInt
Out[25]:
In [26]:
typemin(Int32)
Out[26]:
DataArray
methods for R numeric vectors create DataArray
objects in which the missing data values are converted to the representation used in the DataArrays
package. In the process R logical
vectors are converted to Bool
objects and R factor
objects are converted to PackedDataArray
objects.
In [27]:
"c(1,2,NA,3)" |> rcopy
Out[27]:
In [28]:
"c(1,2,NA,4)" |> reval |> DataArray # check this result
Out[28]:
In [29]:
"gl(2,4,labels=c('A','B'))" |> reval |> DataArray
Out[29]:
When the implementation of NullAble
types in Julia is more mature it may make sense to convert to those types for missing data representation.
For many people the ability to use R graphics packages, such as ggplot2
, would be a prime motivation for using the RCall
package. Unfortunately, for me, trying to initialize the graphics system in an embedded R always fails.
In [30]:
"pdf(file='/tmp/plt1.pdf'); plot(1:10,(1:10)^2); dev.off()" |> rcopy
To understand the internal structures used by R it helps to know a bit of the history of the language and implementation. In the mid 1990's Ross Ihaka and Robert Gentleman, both then at the University of Auckland, embarked on developing a language implementation that was "not unlike S", a language developed by John Chambers and others at AT&T Bell Labs. The Bell Labs S
implementation and the commercial S-PLUS
implementation were proprietary, closed-source code. Ross and Robert chose to base their open-source implementation internally on Scheme, as described in Abelson and Sussmans book "Structure and Interpretation of Computer Programs".
Those familiar with Lisp will recognize many terms and concepts
in the C API for R. R objects are represented as a C struct called a SEXPREC
or symbolic expression. This struct is actually a union of different representations for different types of R objects. The SEXP
type in the API is a pointer to a SEXPREC
. Most functions in the C API to R return such a pointer.
The Julia abstract type, also called SEXPREC
, has several subtypes corresponding to the internal R structures.
In [31]:
subtypes(SEXPREC)
Out[31]:
The reval
function is the low-level version of rcopy
. It evaluates a Julia Symbol
or Julia String
containing an R expression and returns one of the Julia SEXPREC
types.
In [32]:
m1 = reval(:m1)
Out[32]:
In [33]:
reval("coef(summary(m1))")
Out[33]:
Methods for the Julia sexp
generic create SEXPREC
representations of Julia types, copying the data from Julia into storage allocated by R
. There is also an sexp(p::Ptr{Void})
method that takes an SEXP
(pointer to an R SEXPREC
), usually the value of ccall
of a function in the R API, and converts it to the appropriate type of Julia SEXPREC
.
As seen above, a Julia SEXPREC
type like VecSxp
or RealSxp
consists of an integer tag, several pointers and, in some cases, other information like the length of a vector.
In [34]:
names(RCall.RealSxp)
Out[34]:
In [35]:
names(RCall.VecSxp)
Out[35]:
The Julia type is created by unsafe_load
applied to the SEXP pointer returned by the R API function followed by overwriting two of the pointers used by R's garbage collector. The p
pointer is to the original SEXP
and the pv
pointer is to the contents of a vector. The numerical value of the pv
pointer is p
plus a fixed offset, which happens to be 40 bytes on a 64-bit system.
In [36]:
RCall.voffset
Out[36]:
In [37]:
m1.pv - m1.p
Out[37]:
In [38]:
int(m1.pv - m1.p)
Out[38]:
The reason for storing both p
and pv
is because pv
has both an address and an eltype
. Conversion of pv
with, say, pointer_to_array
, returns a vector of the appropriate Julia bitstype.
In [39]:
m1coef = reval("coef(m1)");
pointer_to_array(m1coef.pv,m1coef.length)
Out[39]:
or, equivalently,
In [40]:
vec(m1coef)
Out[40]:
Although this result looks like the result of rcopy(m1coef)
there is an important difference between the two. The contents of the vec
result are in storage allocated by R and may be trashed by the R garbage collector. The contents of the rcopy
result have been copied to storage allocated by Julia and controlled by the Julia garbage collector.
This is why reval
is a low-level interface. It gives you enough rope to hang yourself.
Another use of sexp
is unravelling some of the fields in an SEXPREC
. As mentioned previously, R objects can contain "attributes". The attrib
field in a Julia SEXPREC
type is the pointer to the attributes.
In [41]:
m1.attrib
Out[41]:
In [42]:
sexp(m1.attrib)
Out[42]:
Somewhat confusingly, what is called a list
in R is internally a VecSxp
or vector of SEXP
s (pointers to SEXPREC
s). The type of SEXPREC
known as a ListSxp
is a cons cell. Types that can occur in a cons cell are
In [43]:
RCall.PairList
Out[43]:
Except for NilSxp
, which is the NULL sentinel value, these PairList
types have car
, cdr
and tag
fields.
In [44]:
names(RCall.ListSxp)
Out[44]:
(The genc_prev
pointer field in this and other other non-vector concrete SEXPREC
types is an artifact of the constructor's using unsafe_copy
. It is not used in Julia.)
The vector-like concrete Julia SEXPREC
types are
In [45]:
RCall.RVector
Out[45]:
which are further divided into
In [46]:
RCall.VectorAtomic
Out[46]:
for which vec
returns an array of Numbers
, and
In [47]:
RCall.VectorList
Out[47]:
for which vec
returns a vector of pointers.
In [48]:
vec(m1)
Out[48]:
It is usually more meaningful to map sexp
over this vector of pointers
In [49]:
map(sexp,vec(m1))
Out[49]:
The SymSxp
type is an R symbol. The CharSxp
type is character string represented as a vector vector of bytes. Although the byte vector is null-terminated the length
field is the string length (i.e. the number of characters before the null terminator).
In [50]:
sexp(sexp(m1.attrib).tag)
Out[50]:
In [51]:
names(RCall.SymSxp)
Out[51]:
In [52]:
snm = sexp(sexp(sexp(m1.attrib).tag).pname)
Out[52]:
In [53]:
vec(snm)
Out[53]:
In [54]:
bytestring(vec(snm))
Out[54]:
Don't confuse a CharSxp
with a "character" vector in R. The "character" type in the R REPL is a StrSxp
, which internally is a vector of pointers to CharSxp
objects. Because there are no scalars at the level of the R REPL, CharSxp
objects cannot occur there.
It is worthwhile examining the "all data is stored in a vector" approach of R. A single numeric value is represented as a vector of length 1.
In [55]:
"1" |> reval
Out[55]:
Also, we can see that R has the peculiar convention that numeric literals, even those without a decimal point, are converted to Float64
values. An integer literal value is written as
In [56]:
"1L" |> reval
Out[56]:
Returning to the unravelling of attributes, the car
of the first cons cell is the vector of names of the R "list".
In [57]:
sexp(sexp(m1.attrib).car)
Out[57]:
In [58]:
sexp(sexp(m1.attrib).car) |> rcopy
Out[58]:
The cdr
of the first cons cell is a pointer to the next cons cell or to the NilSxp
(there is only one NilSxp
).
In [59]:
sexp(sexp(m1.attrib).cdr)
Out[59]:
In [60]:
rcopy(sexp(sexp(m1.attrib).cdr).tag)
Out[60]:
In [61]:
rcopy(sexp(sexp(m1.attrib).cdr).car)
Out[61]:
In [62]:
sexp(sexp(sexp(m1.attrib).cdr).cdr)
Out[62]:
If you are looking for a particular attribute you can access it by name using getAttrib
In [63]:
getAttrib(m1,:class)
Out[63]:
In [64]:
getAttrib(m1,:class) |> rcopy
Out[64]:
If an attribute of the given name does not exist getAttrib
returns NilSxp
.
In [65]:
getAttrib(m1,:foo)
Out[65]:
Following the Jon Snow convention, rcopy
returns the Julia nothing
object for an NilSxp
In [66]:
getAttrib(m1,:foo) |> rcopy
Currently the Base.names
method for a SEXPREC
returns the R names
attribute. This is convenient if you know of the names
function in R but may be changed in later versions of RCall
. (It is in some ways a misuse of the Base.names
generic which should return the names of fields from a Julia object.)
In [67]:
names(m1)
Out[67]:
length
and size
methods return the usual results
In [68]:
length(m1)
Out[68]:
In [69]:
size(m1)
Out[69]:
In [70]:
length(reval("coef(summary(m1))"))
Out[70]:
In [71]:
size(reval("coef(summary(m1))"))
Out[71]: