My data analysis workflow depends on R. I tend to use old Matlab code, run in Octave or via oct2py, or new Python code for data data wrangling. I have moved to matplotlib and seaborn for all graphics. I still depend on R for basic stats, multivariate analyses, and machine learning. There is so much in the R universe and, with the easy-to-use rpy2 library, there is no reason not to use R.

%R magic is provided by rpy2, and it works really well for interactive data analysis or one-off calls to specialized libraries from CRAN. However, for more intesive analyses of data from multiple experiemnts, I found issues with memory management in rpy2 (documented here). My laptop PC does not have enough RAM (8 GBs) for me to run through a batch of LFPs files from, say, a dozen experiment (e.g. 12x16 or 192 channels with typically more than a million samples. It took a bit of work to figure out how to release and clean memory between channels or files. I tried to document my solutions to this issue in the post below.

A Jupyter notebook for this post is available here.



In [1]:

    
import numpy as np, pandas as pd, feather
from scipy.io import loadmat, savemat

(I have found that the rpy2 extension only works on my PCs [Linux Mint 17 and Anaconda for Python 3.5] if I import the readline libraary before the rpy2 extension.)



In [2]:

    
import readline
%load_ext rpy2.ipython

the memory_profiler library

I used the memory_profiler library to assess memory usage in this notebook.



In [3]:

    
# https://pypi.python.org/pypi/memory_profiler
from memory_profiler import memory_usage
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[130.04296875]

Switch to ~/temp folder for writing files. (Most of my PCs are backed up via Dropbox and SpiderOak and I hate wasting bandwidth.)



In [4]:

    
%cd ~/temp









    



/home/mark/temp

clean up from previous runs, as writing over takes longer than deleting and writing



In [5]:

    
%rm test*.*

Create some data

typical LFP matrix, 32 channels, 1.5 million samples at 1 kHz



In [6]:

    
ADmat = np.random.randn(32, 1500000) / 100



In [7]:

    
whos ndarray









    



Variable   Type       Data/Info
-------------------------------
ADmat      ndarray    32x1500000: 48000000 elems, type `float64`, 384000000 bytes (366.2109375 Mb)



In [8]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[496.5546875]

save the matrix for analysis in R



In [9]:

    
np.save('test.npy', ADmat)



In [10]:

    
%ls -lstr test.npy









    



375012 -rw-r--r-- 1 mark mark 384000080 Aug  1 11:42 test.npy



In [11]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[496.60546875]

load into R using R magic

RcppCNPy is a fantastic library for R that lets you read and write numpy data file.



In [12]:

    
%%R
library(RcppCNPy)
setwd("~/temp")
ADmat = npyLoad('test.npy', type="numeric", dotranspose=FALSE)



In [13]:

    
%R str(ADmat)









    





 num [1:32, 1:1500000] 0.017809 -0.006086 0.002423 -0.000249 -0.013145 ...



In [14]:

    
%R ls()









    Out[14]:





array(['ADmat'], 
      dtype='<U5')



In [15]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[877.0546875]

remove ADmat to assess memory use with Rpush below



In [16]:

    
%R rm(list=ls())



In [17]:

    
%R ls()









    Out[17]:





array([], dtype=float64)



In [18]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[877.0546875]

ADmat is gone but memory is not released



In [19]:

    
%R gc(); # garbage collection



In [20]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[510.95703125]

now the memory is released



In [21]:

    
%Rpush ADmat

htop (and some calcs in bc) report an extra 340 to 360 MB (variable over runs) following Rpush... why?



In [22]:

    
%R str(ADmat)









    





 num [1:32, 1:1500000] 0.0178 0.0141 0.0032 0.004 -0.0101 ...



In [23]:

    
%R ls()









    Out[23]:





array(['ADmat'], 
      dtype='<U5')



In [24]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[1244.09375]

%Rpush uses a lot more memory than if you save the file with numpy and load into R using the RcppCNPy library!



In [25]:

    
%R rm(ADmat)



In [26]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[1244.1171875]



In [27]:

    
%R gc();



In [28]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[877.9140625]

now the memory consumed by ADmat is released but the extra memory is still consumed



In [29]:

    
%R ls()









    Out[29]:





array([], dtype=float64)

the empy array is there after ADmat is gone



In [30]:

    
%R rm(list=ls())



In [31]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[877.9140625]



In [32]:

    
%R gc();



In [33]:

    
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)









    



[511.7109375]

now we are back to where we started!

Conclusions

Clear the R workspace of all variables and run gc between loops
If your data takes up a lot of memory, use numpy to save the variables into files and load them in R using the RcppCNPy library.