R-Python integration



In [1]:

    
from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)









    Out[1]:



In [2]:

    
import pandas as pd
import numpy as np
import rpy2.robjects as robjects

Running simple command/code in R



In [3]:

    
pi = robjects.r('pi')
pi[0]









    Out[3]:





3.141592653589793

Running code using IPython R magic



In [4]:

    
%load_ext rmagic

Create data in R, compute in R, return result to Python

Run linear regression in R, print out a summary, and pass the result variable error back to Python:



In [5]:

    
%%R -o error
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

lm_fit<-lm(y~x1+x2+x3,data=training)
print(summary(lm_fit))

predictions<-predict(lm_fit,newdata=testing)
error<-sqrt((sum((testing$y-predictions)^2))/nrow(testing))









    





Call:
lm(formula = y ~ x1 + x2 + x3, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-379.34 -125.71  -29.88   87.58  732.59 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.234e+01  2.495e+01  -2.098   0.0363 *  
x1           2.414e-01  1.589e-02  15.188   <2e-16 ***
x2           1.553e-04  9.767e-06  15.900   <2e-16 ***
x3           6.404e+01  4.827e+00  13.267   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 166.4 on 746 degrees of freedom
Multiple R-squared: 0.6613,	Adjusted R-squared: 0.6599 
F-statistic: 485.5 on 3 and 746 DF,  p-value: < 2.2e-16



In [6]:

    
print error









    



[ 169.85333821]

Create data in R, compute in Python

First we create the data in R:



In [7]:

    
%%R -o training,testing
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

The variables training and testing are now available as numpy array in Python namespace due to the -o flag in the cell above. We'll create pandas DataFrame from them:



In [8]:

    
tr = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], training)))
te = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], testing)))

tr.head()









    Out[8]:






  
    
      
      x1
      x2
      x3
      y
    
  
  
    
      0
       724.861370
         19728.318211
       6.430894
       614
    
    
      1
       103.074180
           928.821687
       5.132348
       108
    
    
      2
       606.561051
       1050676.686068
       6.564257
       518
    
    
      3
       862.674044
         91504.275820
       4.670171
       879
    
    
      4
       393.014599
          1134.679888
       5.721699
       379

Create linear regression model, print a summary:



In [9]:

    
from statsmodels.formula.api import ols

lm = ols('y ~ x1 + x2 + x3', tr).fit()
lm.summary()









    Out[9]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.661 


  Model:                    OLS          Adj. R-squared:        0.660 


  Method:              Least Squares     F-statistic:           485.5 


  Date:              Sun, 05 May 2013    Prob (F-statistic):  7.53e-175


  Time:                  12:06:08        Log-Likelihood:      -4898.0 


  No. Observations:          750         AIC:                   9804. 


  Df Residuals:              746         BIC:                   9823. 


  Df Model:                    3                                      




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept    -52.3400     24.950     -2.098   0.036   -101.321    -3.359


  x1             0.2414      0.016     15.188   0.000      0.210     0.273


  x2             0.0002   9.77e-06     15.900   0.000      0.000     0.000


  x3            64.0431      4.827     13.267   0.000     54.567    73.520




  Omnibus:        85.222    Durbin-Watson:         1.999


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    112.468


  Skew:            0.898    Prob(JB):           3.78e-25


  Kurtosis:        3.609    Cond. No.           3.41e+06

Predict and compute RMSE:



In [10]:

    
pred = lm.predict(te)

error = sqrt((sum((te.y - pred)**2)) / len(te))
error









    Out[10]:





169.85333821453432

Create data in Python, compute in R

First we create data (numpy array) in Python:



In [11]:

    
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])

We pass them into R using the -i flag, run linear regression in R, print a summary and plot, output the result back in Python:



In [12]:

    
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)









    





Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:  0.81,	Adjusted R-squared: 0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739

We also pass the model coefficients from R as variable XYcoef:



In [13]:

    
XYcoef









    Out[13]:





array([ 3.2,  0.9])



In [ ]:

	x1	x2	x3	y
0	724.861370	19728.318211	6.430894	614
1	103.074180	928.821687	5.132348	108
2	606.561051	1050676.686068	6.564257	518
3	862.674044	91504.275820	4.670171	879
4	393.014599	1134.679888	5.721699	379

Dep. Variable:	y	R-squared:	0.661
Model:	OLS	Adj. R-squared:	0.660
Method:	Least Squares	F-statistic:	485.5
Date:	Sun, 05 May 2013	Prob (F-statistic):	7.53e-175
Time:	12:06:08	Log-Likelihood:	-4898.0
No. Observations:	750	AIC:	9804.
Df Residuals:	746	BIC:	9823.
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-52.3400	24.950	-2.098	0.036	-101.321 -3.359
x1	0.2414	0.016	15.188	0.000	0.210 0.273
x2	0.0002	9.77e-06	15.900	0.000	0.000 0.000
x3	64.0431	4.827	13.267	0.000	54.567 73.520

Omnibus:	85.222	Durbin-Watson:	1.999
Prob(Omnibus):	0.000	Jarque-Bera (JB):	112.468
Skew:	0.898	Prob(JB):	3.78e-25
Kurtosis:	3.609	Cond. No.	3.41e+06