R-Python integration


In [1]:
from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)


Out[1]:

In [2]:
import pandas as pd
import numpy as np
import rpy2.robjects as robjects

Running simple command/code in R


In [3]:
pi = robjects.r('pi')
pi[0]


Out[3]:
3.141592653589793

Running code using IPython R magic


In [4]:
%load_ext rmagic

Create data in R, compute in R, return result to Python

Run linear regression in R, print out a summary, and pass the result variable error back to Python:


In [5]:
%%R -o error
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

lm_fit<-lm(y~x1+x2+x3,data=training)
print(summary(lm_fit))

predictions<-predict(lm_fit,newdata=testing)
error<-sqrt((sum((testing$y-predictions)^2))/nrow(testing))


Call:
lm(formula = y ~ x1 + x2 + x3, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-379.34 -125.71  -29.88   87.58  732.59 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.234e+01  2.495e+01  -2.098   0.0363 *  
x1           2.414e-01  1.589e-02  15.188   <2e-16 ***
x2           1.553e-04  9.767e-06  15.900   <2e-16 ***
x3           6.404e+01  4.827e+00  13.267   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 166.4 on 746 degrees of freedom
Multiple R-squared: 0.6613,	Adjusted R-squared: 0.6599 
F-statistic: 485.5 on 3 and 746 DF,  p-value: < 2.2e-16 


In [6]:
print error


[ 169.85333821]

Create data in R, compute in Python

First we create the data in R:


In [7]:
%%R -o training,testing
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

The variables training and testing are now available as numpy array in Python namespace due to the -o flag in the cell above. We'll create pandas DataFrame from them:


In [8]:
tr = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], training)))
te = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], testing)))

tr.head()


Out[8]:
x1 x2 x3 y
0 724.861370 19728.318211 6.430894 614
1 103.074180 928.821687 5.132348 108
2 606.561051 1050676.686068 6.564257 518
3 862.674044 91504.275820 4.670171 879
4 393.014599 1134.679888 5.721699 379

Create linear regression model, print a summary:


In [9]:
from statsmodels.formula.api import ols

lm = ols('y ~ x1 + x2 + x3', tr).fit()
lm.summary()


Out[9]:
OLS Regression Results
Dep. Variable: y R-squared: 0.661
Model: OLS Adj. R-squared: 0.660
Method: Least Squares F-statistic: 485.5
Date: Sun, 05 May 2013 Prob (F-statistic): 7.53e-175
Time: 12:06:08 Log-Likelihood: -4898.0
No. Observations: 750 AIC: 9804.
Df Residuals: 746 BIC: 9823.
Df Model: 3
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -52.3400 24.950 -2.098 0.036 -101.321 -3.359
x1 0.2414 0.016 15.188 0.000 0.210 0.273
x2 0.0002 9.77e-06 15.900 0.000 0.000 0.000
x3 64.0431 4.827 13.267 0.000 54.567 73.520
Omnibus: 85.222 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 112.468
Skew: 0.898 Prob(JB): 3.78e-25
Kurtosis: 3.609 Cond. No. 3.41e+06

Predict and compute RMSE:


In [10]:
pred = lm.predict(te)

error = sqrt((sum((te.y - pred)**2)) / len(te))
error


Out[10]:
169.85333821453432

Create data in Python, compute in R

First we create data (numpy array) in Python:


In [11]:
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])

We pass them into R using the -i flag, run linear regression in R, print a summary and plot, output the result back in Python:


In [12]:
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)


Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:  0.81,	Adjusted R-squared: 0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739 

We also pass the model coefficients from R as variable XYcoef:


In [13]:
XYcoef


Out[13]:
array([ 3.2,  0.9])

In [ ]: