My bicorr_hist_master
matrix is enormous, and it takes up 15 GB of space (if 0.25 ns time binning). Most of it is empty, so I could instead convert it to a sparse matrix and store a much smaller matrix to file. Investigate this.
Start by loading a bicorr_hist_master
into memory for this study.
det_df
pandas dataframe for loading detector pair indices, angles
In [1]:
import numpy as np
import scipy.io as sio
import os
import sys
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import inspect
from tqdm import tqdm
In [4]:
sys.path.append('../scripts/')
In [5]:
import bicorr as bicorr
In [6]:
%load_ext autoreload
%autoreload 2
In [7]:
help(bicorr.load_bicorr)
Where are the files I want to import? Use the same file that I use in my analysis build_bicorr_hist_master
.
In [10]:
os.listdir('../datar/1')
Out[10]:
I am going to provide the full bicorr_path
as input.
In [12]:
bicorr_data = bicorr.load_bicorr(bicorr_path = '../datar/1/bicorr1_part')
Now I will build bicorr_hist_master
.
In [14]:
det_df = bicorr.load_det_df()
dt_bin_edges, num_dt_bins = bicorr.build_dt_bin_edges()
In [15]:
bhm = bicorr.alloc_bhm(len(det_df),4,num_dt_bins)
In [17]:
bhm = bicorr.fill_bhm(bhm, bicorr_data, det_df, dt_bin_edges)
In [18]:
bhp = bicorr.build_bhp(bhm,dt_bin_edges)[0]
bicorr.bicorr_plot(bhp, dt_bin_edges, show_flag = True)
See documentation on Wikipedia: https://en.wikipedia.org/wiki/Sparse_matrix
A sparse matrix has most of its elements equal to zero. The number of zero-valued elements divided by the total number of elements is called its sparsity, and is equal to 1 minus the density of the matrix.
In [20]:
np.count_nonzero(bhm)
Out[20]:
In [21]:
bhm.size
Out[21]:
In [23]:
1-np.count_nonzero(bhm)/bhm.size
Out[23]:
Wow... My matrix is 98.8% sparse. I should definitely be using a sparse matrix to store the information.
scipy.sparse
SciPy has a 2-D sparse matrix package for numeric data. Documentation is available here: https://docs.scipy.org/doc/scipy/reference/sparse.html#usage-information. How do I use it? It will not be simple since my numpy array has four dimensions.
In [14]:
from scipy import sparse
I am following this StackOverflow for how to convert the array to a sparse matrix: http://stackoverflow.com/questions/7922487/how-to-transform-numpy-matrix-or-array-to-scipy-sparse-matrix
In [15]:
s_bicorr_hist_Master = sparse.csr_matrix(bicorr_hist_master[0,0,:,:])
In [16]:
print(s_bicorr_hist_Master)
In [17]:
type(s_bicorr_hist_Master)
Out[17]:
In [18]:
plt.pcolormesh(dt_bin_edges,dt_bin_edges,bicorr_hist_master[0,0,:,:],norm=matplotlib.colors.LogNorm())
plt.colorbar()
plt.show()
It looks like this scipy sparse matrix function will not suit my needs because it is limited to two-dimensional numpy arrays. I need to be able to store a four-dimensional array as a sparse matrix. I will try to write my own technique in the following section.
I am going to build a numpy array with a specified numpy data type (dType
). There will be five pieces of data for each element in the array.
pair_i
: Detector pair, length = 990. Use np.uint16
.type_i
: Interaction type, length 4 (0=nn, 1=np, 2=pn, 3=pp). Use np.uint8
.det1t_i
: dt bin for detector 1 (up to 1000). Use np.uint16
.det2t_i
: dt bin for detector 2 (up to 1000). Use np.uint16
.count
: Value of that element. Use np.uint64
. First establish the formatting of each element in the array.
In [24]:
sparseType = np.dtype([('pair_i', np.uint16), ('type_i', np.uint8), ('det1t_i', np.uint16), ('det2t_i', np.uint16), ('count', np.uint32)])
In [25]:
num_nonzero = np.count_nonzero(bhm)
In [26]:
print(num_nonzero)
I'm going to ask numpy to make an array of the indices for non-zero values with np.nonzero
. Numpy will return it as an array of tuples organized somewhat strangely. Explore how the indexing works.
First store the indices as tuples using np.nonzero
. The tuples are stored as four massive tuples of 2767589 elements. They are so large that I can't print them to my screen. How do I extract the four arguments for the position of the $i^{th}$ nonzero value?
In [28]:
i_nonzero = np.nonzero(bhm)
In [29]:
counts = bhm[i_nonzero]
In [30]:
counts[600:800]
Out[30]:
How do I find the element at the top there that is equal to three? It is at the 627$^{th}$ index position in i_nonzero
and counts
. How do I find the corresponding indices in bicorr_hist_master
?
In [31]:
i = 637
counts[637]
Out[31]:
In [32]:
counts[i]
Out[32]:
In [33]:
i_nonzero[0][i]
Out[33]:
In [34]:
i_nonzero[1][i]
Out[34]:
In [35]:
i_nonzero[2][i]
Out[35]:
In [36]:
i_nonzero[3][i]
Out[36]:
Can I call them all at once? It doesn't seem so...
In [37]:
i_nonzero[:][i]
In [41]:
bhm[100,1,219,211]
Out[41]:
In [42]:
sparse_bhm = np.zeros(num_nonzero,dtype=sparseType)
In [45]:
for i in tqdm(np.arange(0,num_nonzero),ascii=True):
sparse_bhm[i]['pair_i'] = i_nonzero[0][i]
sparse_bhm[i]['type_i'] = i_nonzero[1][i]
sparse_bhm[i]['det1t_i'] = i_nonzero[2][i]
sparse_bhm[i]['det2t_i'] = i_nonzero[3][i]
sparse_bhm[i]['count'] = counts[i]
print(sparse_bhm[0:20])
In [46]:
sparse_bhm[0]
Out[46]:
In [47]:
bhm[0,0,338,745]
Out[47]:
Functionalize this in my bicorr
module.
In [48]:
print(inspect.getsource(bicorr.generate_sparse_bhm))
Start over and try generating sparse_bhm
from bicorr_hist_master
.
In [49]:
sparse_bhm = bicorr.generate_sparse_bhm(bhm)
In [50]:
sparse_bhm[0:20]
Out[50]:
sparse_bhm
, dt_bin_edges
to disk and reloadIn order to make this useful, I need a clean and simple way of storing the sparse matrix to disk and reloading it. I used to save the following three variables to disk:
bicorr_hist_master
dict_pair_to_index
dt_bin_edges
Instead I will save the following three variables:
sparse_bhm
det_df
dt_bin_edges
Use the same np.save
technique and try it here:
In [43]:
np.savez('sparse_bhm', det_df = det_df, dt_bin_edges = dt_bin_edges, sparse_bhm=sparse_bhm)
This went much faster, as the bicorr_hist_master
was 15 GB in size and this is only 30 MB in size. What a difference!
Write this into a function, and provide an optional destination folder for saving.
In [51]:
help(bicorr.save_sparse_bhm)
In [53]:
bicorr.save_sparse_bhm(sparse_bhm, dt_bin_edges, '../datar')
In [54]:
sys.getsizeof(bhm)
Out[54]:
In [55]:
sys.getsizeof(sparse_bhm)
Out[55]:
In [58]:
sys.getsizeof(sparse_bhm)/sys.getsizeof(bhm)
Out[58]:
This is for a partial data set. In a larger data set (tested separately), I reduced my storage needs to 1.5%... hurrah! That is a significant change, and makes it possible to store the data on my local machine.
In [49]:
os.listdir()
Out[49]:
In [50]:
help(bicorr.load_sparse_bhm)
In [9]:
os.listdir('subfolder')
Out[9]:
In [10]:
sparse_bhm, det_df, dt_bin_edges = bicorr.load_sparse_bhm(filepath = "subfolder")
In [12]:
who
In [13]:
sparse_bhm.size
Out[13]:
In [16]:
print(sparse_bhm[0:10])
bicorr_hist_master
The only dimension in the size of bicorr_hist_master
that would change would be the number of time bins. Otherwise, the number of detector pairs and the number of interaction types I am recording will stay the same. Use the functions I have already developed to allocate the array for bicorr_hist_master
.
In [21]:
bicorr_hist_master = bicorr.alloc_bhm(len(det_df),4,len(dt_bin_edges)-1)
Now I need to loop through sparse_bhm
and fill bicorr_hist_master
with the count at the corresponding index. What does sparse_bhm
look like again?
In [19]:
sparse_bhm[0]
Out[19]:
Fill one element
In [20]:
i = 0
bicorr_hist_master[sparse_bhm[i][0],sparse_bhm[i][1],sparse_bhm[i][2],sparse_bhm[i][3]] = sparse_bhm[i][4]
In [21]:
print(bicorr_hist_master[sparse_bhm[i][0],sparse_bhm[i][1],sparse_bhm[i][2],sparse_bhm[i][3]])
Fill all of the elements
In [22]:
for i in tqdm(np.arange(0,sparse_bhm.size)):
bicorr_hist_master[sparse_bhm[i][0],sparse_bhm[i][1],sparse_bhm[i][2],sparse_bhm[i][3]] = sparse_bhm[i][4]
In [23]:
np.max(bicorr_hist_master)
Out[23]:
Plot it to see if that looks correct.
In [25]:
bicorr_hist_plot = bicorr.build_bicorr_hist_plot(bicorr_hist_master,dt_bin_edges)[0]
bicorr.bicorr_plot(bicorr_hist_plot, dt_bin_edges, show_flag = True)
In [26]:
print(inspect.getsource(bicorr.revive_sparse_bhm))
In [11]:
sparse_bhm, det_df, dt_bin_edges = bicorr.load_sparse_bhm(filepath="subfolder")
In [12]:
bicorr_hist_master_sparse = bicorr.revive_sparse_bhm(sparse_bhm, det_df, dt_bin_edges)
In [13]:
np.array_equal(bicorr_hist_master,bicorr_hist_master_sparse)
Out[13]:
Tadaa!
In [ ]: