My data analysis workflow depends on R. I tend to use old Matlab code, run in Octave or via oct2py, or new Python code for data data wrangling. I have moved to matplotlib and seaborn for all graphics. I still depend on R for basic stats, multivariate analyses, and machine learning. There is so much in the R universe and, with the easy-to-use rpy2 library, there is no reason not to use R.
%R magic is provided by rpy2, and it works really well for interactive data analysis or one-off calls to specialized libraries from CRAN. However, for more intesive analyses of data from multiple experiemnts, I found issues with memory management in rpy2 (documented here). My laptop PC does not have enough RAM (8 GBs) for me to run through a batch of LFPs files from, say, a dozen experiment (e.g. 12x16 or 192 channels with typically more than a million samples. It took a bit of work to figure out how to release and clean memory between channels or files. I tried to document my solutions to this issue in the post below.
A Jupyter notebook for this post is available here.
In [1]:
import numpy as np, pandas as pd, feather
from scipy.io import loadmat, savemat
(I have found that the rpy2 extension only works on my PCs [Linux Mint 17 and Anaconda for Python 3.5] if I import the readline libraary before the rpy2 extension.)
In [2]:
import readline
%load_ext rpy2.ipython
In [3]:
# https://pypi.python.org/pypi/memory_profiler
from memory_profiler import memory_usage
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
Switch to ~/temp folder for writing files. (Most of my PCs are backed up via Dropbox and SpiderOak and I hate wasting bandwidth.)
In [4]:
%cd ~/temp
clean up from previous runs, as writing over takes longer than deleting and writing
In [5]:
%rm test*.*
In [6]:
ADmat = np.random.randn(32, 1500000) / 100
In [7]:
whos ndarray
In [8]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
save the matrix for analysis in R
In [9]:
np.save('test.npy', ADmat)
In [10]:
%ls -lstr test.npy
In [11]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
RcppCNPy is a fantastic library for R that lets you read and write numpy data file.
In [12]:
%%R
library(RcppCNPy)
setwd("~/temp")
ADmat = npyLoad('test.npy', type="numeric", dotranspose=FALSE)
In [13]:
%R str(ADmat)
In [14]:
%R ls()
Out[14]:
In [15]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
remove ADmat to assess memory use with Rpush below
In [16]:
%R rm(list=ls())
In [17]:
%R ls()
Out[17]:
In [18]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
ADmat is gone but memory is not released
In [19]:
%R gc(); # garbage collection
In [20]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
now the memory is released
In [21]:
%Rpush ADmat
htop (and some calcs in bc) report an extra 340 to 360 MB (variable over runs) following Rpush... why?
In [22]:
%R str(ADmat)
In [23]:
%R ls()
Out[23]:
In [24]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
%Rpush uses a lot more memory than if you save the file with numpy and load into R using the RcppCNPy library!
In [25]:
%R rm(ADmat)
In [26]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
In [27]:
%R gc();
In [28]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
now the memory consumed by ADmat is released but the extra memory is still consumed
In [29]:
%R ls()
Out[29]:
the empy array is there after ADmat is gone
In [30]:
%R rm(list=ls())
In [31]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
In [32]:
%R gc();
In [33]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)
now we are back to where we started!