Douglas Bates
February 24, 2015
Julia, a recently developed language for technical computing, has many favorable features. However, it is still in the early development phase and many techiques for statistical analysis are not yet implemented in Julia.
In contrast R, the most widely used language in statistics, has a mature infrastructure including access to several thousand user-contributed packages that implement a wide variety of statistical techniques. Obviously this wealth of software is not going to be reimplemented overnight. It is desirable to be able to access R and R packages from within Julia.
Furthermore, one of the admirable properties of R is that it has a sophisticated system for storing data sets and their metadata compactly. It is expected that R packages will provide data sets to illustrate techniques and to provide reference results against which to compare other implementations. Even though a data archival format for Julia is available in the HDF5 package, it does not by itself provide access to all the data sets that already exist in R packages.
The RCall package for Julia provides the ability to run an embedded R within Julia. One basic usage is to create a copy in Julia of a dataset from an R package.
In addition to showing some basic usage of RCall this notebook describes some of the implementation details.
The Rcall package is installed by running
Pkg.add("RCall")
In a typical installation the shell commands R RHOME and Rscript are called to determine the location of R's shared library and the values of some environment variables. A successful installation requires R and Rscript to be on the user's search path.
The RCall package can be configured to use another version of R by setting several environment variables; R_HOME, R_DOC_DIR, R_INCLUDE_DIR, R_SHARE_DIR and LD_LIBRARY_PATH to appropriate values then running
Pkg.build("RCall")
The RCall package is attached to the current Julia session with
using RCall
It is a good idea to also attach the DataArrays and DataFrames packages.
In [1]:
using DataArrays,DataFrames,RCall
The Rcall package adds methods for DataFrame and DataArray that evaluate a Julia Symbol or ASCIIString in R and create the indicated Julia type from the result, if possible.
For example
In [2]:
attenu = DataFrame(:attenu)
Out[2]:
creates a copy in Julia of the attenu dataset. It is possible to use a symbol here because the attenu data frame is in the R datasets package, which is, by default, loaded at startup.
To access data frames from an R package that is not loaded at startup, you can use the :: notation in a character string
In [3]:
DataFrame("ggplot2::diamonds")
Out[3]:
or first attach the package's namespace, then use a symbol.
In [4]:
rcopy("library(robustbase)")
Out[4]:
In [5]:
coleman = DataFrame(:coleman)
Out[5]:
The difference between using the :: operator in R and evaluating an expression like "library(robustbase)" is that the latter attaches the namespace of the package to the R search path. The :: operator accesses the package's contents but does not put it on the search path.
You can also check the search path explicitly
In [6]:
rcopy("search()")
Out[6]:
The RCall package exports functions rcopy, reval, rparse, rprint and sexp, the datatype SEXPREC, and the globalEnv object.
In [7]:
whos(RCall)
rcopy, rprint and globalEnv provide the high-level interface.
As seen in the last section, rcopy evaluates a symbol or a character string in the embedded R. The reason it is called rcopy is because it evaluates the expression and copies the contents of the return value to storage allocated by Julia.
This is the preferred way to evaluate an R expression because the value of the R expression is copied into storage allocated by Julia and displayed by Julia. Some users prefer to write such calls as a Julia pipe so that the R expression occurs first.
In [8]:
"search()" |> rcopy
Out[8]:
Sometimes the R expression to be evaluated contains a quoted string. It is easiest to use single quotes in the R expression because, in R, single quotes, ', are equivalent to double quotes, ". In a Julia literal string double quotes must be escaped but single quotes do not.
In [9]:
"exists('airmiles')" |> rcopy
Out[9]:
The result of rcopy cannot be used as an argument to other functions in the C API for R. To do this we must preserve the intermediate representation returned by reval.
Also, because there is not a one-to-one correspondence between R objects and Julia objects, information is often lost when to Julia.
For example, the value of exists('airmiles') is a logical vector of length 1 in R but logical values in R happen to be stored as 32-bit integers and rcopy returns this type.
Julia uses types to represent objects with data and structural metadata. In R the structural metadata, such as array dimensions, names or dimension names, are stored as attributes of the object. rcopy preserves array dimensions but drops other attributes.
For example, airmiles is an R time series.
In [10]:
"names(attributes(airmiles))" |> rcopy
Out[10]:
Access to the attributes is available from the result of reval, the low-level evaluation of R expressions, which is described below.
The rprint generic applies R's printing methods which, naturally, do use the information in the attributes.
In [11]:
rprint(:airmiles)
Top-level assignments in R are in what is called the global environment. Assignments in Julia to this R environment are written as assignments to names in the globalEnv object exported by the RCall package.
In [12]:
globalEnv[:x] = [1:10];
rprint(:x)
In [13]:
"mean(x)" |> rcopy
Out[13]:
In [14]:
"ls()" |> rcopy
Out[14]:
Typing globalEnv for each assignment can get tedious. Some users prefer to create a shorter alias, such as g.
In [15]:
const g = globalEnv;
g[:y] = "Hello world";
rprint(:y)
The conversion of a Julia object to an R object is performed by methods for the sexp generic. If an sexp method doesn't exist for the Julia object you will see an error of the form
In [16]:
g[:x] = 1:10
Please open an issue or create a pull request on the github repository if you have a reasonable way of representing the Julia type in R.
Base R and various R packages provide a wide variety of model-fitting functions. Frequently these functions return an R list object with a class attribute. At present you can't count on rcopy being able to handle lists that may contain language elements.
In [17]:
"m1 <- lmrob(Y ~ ., coleman)" |> rcopy
Note that the failure is in copying the results of the model fit to Julia. m1 has been evaluated and assigned in R so the usual extractor functions in R can be applied to it. If they return simple objects, rcopy can be applied.
In [18]:
rprint(:m1)
In [19]:
"coef(m1)" |> rprint
In [20]:
m1coef = rcopy("coef(m1)")
Out[20]:
In [21]:
"coef(summary(m1))" |> rprint
In [22]:
m1coefmat = rcopy("coef(summary(m1))")
Out[22]:
We see that rcopy drops the names and dimnames. Julia Array objects do not have provision for names. The NamedArrays package for Julia does provide for named dimensions. Methods for NamedArray may help.
Missing values are indicated by a sentinel value in R vectors. The sentinel for Float64 vectors is one of the NaN values and the sentinel for integer or logical vectors and for factors is typemin(Int32)
In [23]:
RCall.R_NaReal
Out[23]:
In [24]:
reinterpret(Uint64,ans)
Out[24]:
In [25]:
RCall.R_NaInt
Out[25]:
In [26]:
typemin(Int32)
Out[26]:
DataArray methods for R numeric vectors create DataArray objects in which the missing data values are converted to the representation used in the DataArrays package. In the process R logical vectors are converted to Bool objects and R factor objects are converted to PackedDataArray objects.
In [27]:
"c(1,2,NA,3)" |> rcopy
Out[27]:
In [28]:
"c(1,2,NA,4)" |> reval |> DataArray # check this result
Out[28]:
In [29]:
"gl(2,4,labels=c('A','B'))" |> reval |> DataArray
Out[29]:
When the implementation of NullAble types in Julia is more mature it may make sense to convert to those types for missing data representation.
For many people the ability to use R graphics packages, such as ggplot2, would be a prime motivation for using the RCall package. Unfortunately, for me, trying to initialize the graphics system in an embedded R always fails.
In [30]:
"pdf(file='/tmp/plt1.pdf'); plot(1:10,(1:10)^2); dev.off()" |> rcopy
To understand the internal structures used by R it helps to know a bit of the history of the language and implementation. In the mid 1990's Ross Ihaka and Robert Gentleman, both then at the University of Auckland, embarked on developing a language implementation that was "not unlike S", a language developed by John Chambers and others at AT&T Bell Labs. The Bell Labs S implementation and the commercial S-PLUS implementation were proprietary, closed-source code. Ross and Robert chose to base their open-source implementation internally on Scheme, as described in Abelson and Sussmans book "Structure and Interpretation of Computer Programs".
Those familiar with Lisp will recognize many terms and concepts
in the C API for R. R objects are represented as a C struct called a SEXPREC or symbolic expression. This struct is actually a union of different representations for different types of R objects. The SEXP type in the API is a pointer to a SEXPREC. Most functions in the C API to R return such a pointer.
The Julia abstract type, also called SEXPREC, has several subtypes corresponding to the internal R structures.
In [31]:
subtypes(SEXPREC)
Out[31]:
The reval function is the low-level version of rcopy. It evaluates a Julia Symbol or Julia String containing an R expression and returns one of the Julia SEXPREC types.
In [32]:
m1 = reval(:m1)
Out[32]:
In [33]:
reval("coef(summary(m1))")
Out[33]:
Methods for the Julia sexp generic create SEXPREC representations of Julia types, copying the data from Julia into storage allocated by R
. There is also an sexp(p::Ptr{Void}) method that takes an SEXP (pointer to an R SEXPREC), usually the value of ccall of a function in the R API, and converts it to the appropriate type of Julia SEXPREC.
As seen above, a Julia SEXPREC type like VecSxp or RealSxp consists of an integer tag, several pointers and, in some cases, other information like the length of a vector.
In [34]:
names(RCall.RealSxp)
Out[34]:
In [35]:
names(RCall.VecSxp)
Out[35]:
The Julia type is created by unsafe_load applied to the SEXP pointer returned by the R API function followed by overwriting two of the pointers used by R's garbage collector. The p pointer is to the original SEXP and the pv pointer is to the contents of a vector. The numerical value of the pv pointer is p plus a fixed offset, which happens to be 40 bytes on a 64-bit system.
In [36]:
RCall.voffset
Out[36]:
In [37]:
m1.pv - m1.p
Out[37]:
In [38]:
int(m1.pv - m1.p)
Out[38]:
The reason for storing both p and pv is because pv has both an address and an eltype. Conversion of pv with, say, pointer_to_array, returns a vector of the appropriate Julia bitstype.
In [39]:
m1coef = reval("coef(m1)");
pointer_to_array(m1coef.pv,m1coef.length)
Out[39]:
or, equivalently,
In [40]:
vec(m1coef)
Out[40]:
Although this result looks like the result of rcopy(m1coef) there is an important difference between the two. The contents of the vec result are in storage allocated by R and may be trashed by the R garbage collector. The contents of the rcopy result have been copied to storage allocated by Julia and controlled by the Julia garbage collector.
This is why reval is a low-level interface. It gives you enough rope to hang yourself.
Another use of sexp is unravelling some of the fields in an SEXPREC. As mentioned previously, R objects can contain "attributes". The attrib field in a Julia SEXPREC type is the pointer to the attributes.
In [41]:
m1.attrib
Out[41]:
In [42]:
sexp(m1.attrib)
Out[42]:
Somewhat confusingly, what is called a list in R is internally a VecSxp or vector of SEXPs (pointers to SEXPRECs). The type of SEXPREC known as a ListSxp is a cons cell. Types that can occur in a cons cell are
In [43]:
RCall.PairList
Out[43]:
Except for NilSxp, which is the NULL sentinel value, these PairList types have car, cdr and tag fields.
In [44]:
names(RCall.ListSxp)
Out[44]:
(The genc_prev pointer field in this and other other non-vector concrete SEXPREC types is an artifact of the constructor's using unsafe_copy. It is not used in Julia.)
The vector-like concrete Julia SEXPREC types are
In [45]:
RCall.RVector
Out[45]:
which are further divided into
In [46]:
RCall.VectorAtomic
Out[46]:
for which vec returns an array of Numbers, and
In [47]:
RCall.VectorList
Out[47]:
for which vec returns a vector of pointers.
In [48]:
vec(m1)
Out[48]:
It is usually more meaningful to map sexp over this vector of pointers
In [49]:
map(sexp,vec(m1))
Out[49]:
The SymSxp type is an R symbol. The CharSxp type is character string represented as a vector vector of bytes. Although the byte vector is null-terminated the length field is the string length (i.e. the number of characters before the null terminator).
In [50]:
sexp(sexp(m1.attrib).tag)
Out[50]:
In [51]:
names(RCall.SymSxp)
Out[51]:
In [52]:
snm = sexp(sexp(sexp(m1.attrib).tag).pname)
Out[52]:
In [53]:
vec(snm)
Out[53]:
In [54]:
bytestring(vec(snm))
Out[54]:
Don't confuse a CharSxp with a "character" vector in R. The "character" type in the R REPL is a StrSxp, which internally is a vector of pointers to CharSxp objects. Because there are no scalars at the level of the R REPL, CharSxp objects cannot occur there.
It is worthwhile examining the "all data is stored in a vector" approach of R. A single numeric value is represented as a vector of length 1.
In [55]:
"1" |> reval
Out[55]:
Also, we can see that R has the peculiar convention that numeric literals, even those without a decimal point, are converted to Float64 values. An integer literal value is written as
In [56]:
"1L" |> reval
Out[56]:
Returning to the unravelling of attributes, the car of the first cons cell is the vector of names of the R "list".
In [57]:
sexp(sexp(m1.attrib).car)
Out[57]:
In [58]:
sexp(sexp(m1.attrib).car) |> rcopy
Out[58]:
The cdr of the first cons cell is a pointer to the next cons cell or to the NilSxp (there is only one NilSxp).
In [59]:
sexp(sexp(m1.attrib).cdr)
Out[59]:
In [60]:
rcopy(sexp(sexp(m1.attrib).cdr).tag)
Out[60]:
In [61]:
rcopy(sexp(sexp(m1.attrib).cdr).car)
Out[61]:
In [62]:
sexp(sexp(sexp(m1.attrib).cdr).cdr)
Out[62]:
If you are looking for a particular attribute you can access it by name using getAttrib
In [63]:
getAttrib(m1,:class)
Out[63]:
In [64]:
getAttrib(m1,:class) |> rcopy
Out[64]:
If an attribute of the given name does not exist getAttrib returns NilSxp.
In [65]:
getAttrib(m1,:foo)
Out[65]:
Following the Jon Snow convention, rcopy returns the Julia nothing object for an NilSxp
In [66]:
getAttrib(m1,:foo) |> rcopy
Currently the Base.names method for a SEXPREC returns the R names attribute. This is convenient if you know of the names function in R but may be changed in later versions of RCall. (It is in some ways a misuse of the Base.names generic which should return the names of fields from a Julia object.)
In [67]:
names(m1)
Out[67]:
length and size methods return the usual results
In [68]:
length(m1)
Out[68]:
In [69]:
size(m1)
Out[69]:
In [70]:
length(reval("coef(summary(m1))"))
Out[70]:
In [71]:
size(reval("coef(summary(m1))"))
Out[71]: