R-Python integration



In [5]:

    
from IPython.core.display import HTML



In [7]:

    
import cPickle
import collections
import gzip
import numpy
import os
import sys
import theano
from theano.tensor.shared_randomstreams import RandomStreams
import time

import theano.tensor as T
from numpy import dtype



In [37]:

    
from theano import tensor as T
from theano import function, shared
import numpy

x = shared(numpy.array([[0,1,2], [0,1,2]]))
z = shared(numpy.array([[0,1,1], [0,1,1]]))
size_of_x = 2



In [39]:

    
x.get_value()









    Out[39]:





array([[0, 1, 2],
       [0, 1, 2]])



In [41]:

    
y = theano.tensor.mean(x)



In [71]:

    
train_y = numpy.array([0,1,2,4,5,6 ,7, 8, 9, 2, 8])



In [72]:

    
train_y_T = train_y[numpy.newaxis].T



In [76]:

    
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(n_values = 10, dtype = theano.config.floatX, sparse=False)



In [74]:

    
encode_train_y = enc.fit_transform(train_y_T)



In [68]:

    
train_y_T









    Out[68]:





array([[0],
       [1],
       [2],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9],
       [2],
       [8]])



In [75]:

    
encode_train_y









    Out[75]:





array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.]], dtype=float32)



In [36]:

    
x[:,:size_of_x].get_value()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-36-82d6fdd42f26> in <module>()
----> 1 x[:,:size_of_x].get_value()

AttributeError: 'TensorVariable' object has no attribute 'get_value'



In [ ]:

    
- T.mean(self.x[:,:size_of_x] * T.log(z[:,:size_of_x]) + (1 - self.x[:,:size_of_x]) * T.log(1 - z[:,:size_of_x]), axis=1)



In [32]:

    
a = {}
a['s'] =1
a['b'] =2



In [33]:

    
a









    Out[33]:





{'b': 2, 's': 1}



In [34]:

    
b = a.copy()
b['s'] =3
b['c']=3
b









    Out[34]:





{'b': 2, 'c': 3, 's': 3}



In [35]:

    
a









    Out[35]:





{'b': 2, 's': 1}



In [8]:

    
import pandas as pd
import numpy as np
import rpy2.robjects as robjects
import rpy2.robjects as ro

Running simple command/code in R



In [5]:

    
pi = robjects.r('pi')
pi[0]









    Out[5]:





3.141592653589793

Running code using IPython R magic



In [10]:

    
%load_ext rmagic
%load_ext rpy2.ipython









    



The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

Create data in R, compute in R, return result to Python

Run linear regression in R, print out a summary, and pass the result variable error back to Python:



In [11]:

    
%%R -o error
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

lm_fit<-lm(y~x1+x2+x3,data=training)
print(summary(lm_fit))

predictions<-predict(lm_fit,newdata=testing)
error<-sqrt((sum((testing$y-predictions)^2))/nrow(testing))









    





Call:
lm(formula = y ~ x1 + x2 + x3, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-379.34 -125.71  -29.88   87.58  732.59 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.234e+01  2.495e+01  -2.098   0.0363 *  
x1           2.414e-01  1.589e-02  15.188   <2e-16 ***
x2           1.553e-04  9.767e-06  15.900   <2e-16 ***
x3           6.404e+01  4.827e+00  13.267   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 166.4 on 746 degrees of freedom
Multiple R-squared: 0.6613,	Adjusted R-squared: 0.6599 
F-statistic: 485.5 on 3 and 746 DF,  p-value: < 2.2e-16



In [12]:

    
print error









    



[1] 169.8533

Create data in R, compute in Python

First we create the data in R:



In [13]:

    
%%R -o training,testing
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

The variables training and testing are now available as numpy array in Python namespace due to the -o flag in the cell above. We'll create pandas DataFrame from them:



In [8]:

    
tr = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], training)))
te = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], testing)))

tr.head()









    Out[8]:






  
    
      
      x1
      x2
      x3
      y
    
  
  
    
      0
       724.861370
         19728.318211
       6.430894
       614
    
    
      1
       103.074180
           928.821687
       5.132348
       108
    
    
      2
       606.561051
       1050676.686068
       6.564257
       518
    
    
      3
       862.674044
         91504.275820
       4.670171
       879
    
    
      4
       393.014599
          1134.679888
       5.721699
       379

Create linear regression model, print a summary:



In [9]:

    
from statsmodels.formula.api import ols

lm = ols('y ~ x1 + x2 + x3', tr).fit()
lm.summary()









    Out[9]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.661 


  Model:                    OLS          Adj. R-squared:        0.660 


  Method:              Least Squares     F-statistic:           485.5 


  Date:              Sun, 05 May 2013    Prob (F-statistic):  7.53e-175


  Time:                  12:06:08        Log-Likelihood:      -4898.0 


  No. Observations:          750         AIC:                   9804. 


  Df Residuals:              746         BIC:                   9823. 


  Df Model:                    3                                      




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept    -52.3400     24.950     -2.098   0.036   -101.321    -3.359


  x1             0.2414      0.016     15.188   0.000      0.210     0.273


  x2             0.0002   9.77e-06     15.900   0.000      0.000     0.000


  x3            64.0431      4.827     13.267   0.000     54.567    73.520




  Omnibus:        85.222    Durbin-Watson:         1.999


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    112.468


  Skew:            0.898    Prob(JB):           3.78e-25


  Kurtosis:        3.609    Cond. No.           3.41e+06

Predict and compute RMSE:



In [10]:

    
pred = lm.predict(te)

error = sqrt((sum((te.y - pred)**2)) / len(te))
error









    Out[10]:





169.85333821453432

Create data in Python, compute in R

First we create data (numpy array) in Python:



In [11]:

    
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])

We pass them into R using the -i flag, run linear regression in R, print a summary and plot, output the result back in Python:



In [12]:

    
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)









    





Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:  0.81,	Adjusted R-squared: 0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739

We also pass the model coefficients from R as variable XYcoef:



In [13]:

    
XYcoef









    Out[13]:





array([ 3.2,  0.9])



In [ ]:

	x1	x2	x3	y
0	724.861370	19728.318211	6.430894	614
1	103.074180	928.821687	5.132348	108
2	606.561051	1050676.686068	6.564257	518
3	862.674044	91504.275820	4.670171	879
4	393.014599	1134.679888	5.721699	379

Dep. Variable:	y	R-squared:	0.661
Model:	OLS	Adj. R-squared:	0.660
Method:	Least Squares	F-statistic:	485.5
Date:	Sun, 05 May 2013	Prob (F-statistic):	7.53e-175
Time:	12:06:08	Log-Likelihood:	-4898.0
No. Observations:	750	AIC:	9804.
Df Residuals:	746	BIC:	9823.
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-52.3400	24.950	-2.098	0.036	-101.321 -3.359
x1	0.2414	0.016	15.188	0.000	0.210 0.273
x2	0.0002	9.77e-06	15.900	0.000	0.000 0.000
x3	64.0431	4.827	13.267	0.000	54.567 73.520

Omnibus:	85.222	Durbin-Watson:	1.999
Prob(Omnibus):	0.000	Jarque-Bera (JB):	112.468
Skew:	0.898	Prob(JB):	3.78e-25
Kurtosis:	3.609	Cond. No.	3.41e+06