My bicorr_hist_master matrix is enormous, and it takes up 15 GB of space (if 0.25 ns time binning). Most of it is empty, so I could instead convert it to a sparse matrix and store a much smaller matrix to file. Investigate this.
Start by loading a bicorr_hist_master into memory for this study.
det_df pandas dataframe for loading detector pair indices, angles
In [1]:
    
import numpy as np
import scipy.io as sio
import os
import sys
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import inspect
from tqdm import tqdm
    
In [4]:
    
sys.path.append('../scripts/')
    
In [5]:
    
import bicorr as bicorr
    
In [6]:
    
%load_ext autoreload
%autoreload 2
    
In [7]:
    
help(bicorr.load_bicorr)
    
    
Where are the files I want to import? Use the same file that I use in my analysis build_bicorr_hist_master.
In [10]:
    
os.listdir('../datar/1')
    
    Out[10]:
I am going to provide the full bicorr_path as input.
In [12]:
    
bicorr_data = bicorr.load_bicorr(bicorr_path = '../datar/1/bicorr1_part')
    
Now I will build bicorr_hist_master.
In [14]:
    
det_df = bicorr.load_det_df()
dt_bin_edges, num_dt_bins = bicorr.build_dt_bin_edges()
    
In [15]:
    
bhm = bicorr.alloc_bhm(len(det_df),4,num_dt_bins)
    
In [17]:
    
bhm = bicorr.fill_bhm(bhm, bicorr_data, det_df, dt_bin_edges)
    
    
In [18]:
    
bhp = bicorr.build_bhp(bhm,dt_bin_edges)[0]
bicorr.bicorr_plot(bhp, dt_bin_edges, show_flag = True)
    
    
See documentation on Wikipedia: https://en.wikipedia.org/wiki/Sparse_matrix
A sparse matrix has most of its elements equal to zero. The number of zero-valued elements divided by the total number of elements is called its sparsity, and is equal to 1 minus the density of the matrix.
In [20]:
    
np.count_nonzero(bhm)
    
    Out[20]:
In [21]:
    
bhm.size
    
    Out[21]:
In [23]:
    
1-np.count_nonzero(bhm)/bhm.size
    
    Out[23]:
Wow... My matrix is 98.8% sparse. I should definitely be using a sparse matrix to store the information.
scipy.sparseSciPy has a 2-D sparse matrix package for numeric data. Documentation is available here: https://docs.scipy.org/doc/scipy/reference/sparse.html#usage-information. How do I use it? It will not be simple since my numpy array has four dimensions.
In [14]:
    
from scipy import sparse
    
I am following this StackOverflow for how to convert the array to a sparse matrix: http://stackoverflow.com/questions/7922487/how-to-transform-numpy-matrix-or-array-to-scipy-sparse-matrix
In [15]:
    
s_bicorr_hist_Master = sparse.csr_matrix(bicorr_hist_master[0,0,:,:])
    
In [16]:
    
print(s_bicorr_hist_Master)
    
    
In [17]:
    
type(s_bicorr_hist_Master)
    
    Out[17]:
In [18]:
    
plt.pcolormesh(dt_bin_edges,dt_bin_edges,bicorr_hist_master[0,0,:,:],norm=matplotlib.colors.LogNorm())
plt.colorbar()
plt.show()
    
    
It looks like this scipy sparse matrix function will not suit my needs because it is limited to two-dimensional numpy arrays. I need to be able to store a four-dimensional array as a sparse matrix. I will try to write my own technique in the following section.
I am going to build a numpy array with a specified numpy data type (dType). There will be five pieces of data for each element in the array.
pair_i: Detector pair, length = 990. Use np.uint16.type_i: Interaction type, length 4 (0=nn, 1=np, 2=pn, 3=pp). Use np.uint8.det1t_i: dt bin for detector 1 (up to 1000). Use np.uint16.det2t_i: dt bin for detector 2 (up to 1000). Use np.uint16.count: Value of that element. Use np.uint64. First establish the formatting of each element in the array.
In [24]:
    
sparseType = np.dtype([('pair_i', np.uint16), ('type_i', np.uint8), ('det1t_i', np.uint16), ('det2t_i', np.uint16), ('count', np.uint32)])
    
In [25]:
    
num_nonzero = np.count_nonzero(bhm)
    
In [26]:
    
print(num_nonzero)
    
    
I'm going to ask numpy to make an array of the indices for non-zero values with np.nonzero. Numpy will return it as an array of tuples organized somewhat strangely. Explore how the indexing works.
First store the indices as tuples using np.nonzero. The tuples are stored as four massive tuples of 2767589 elements. They are so large that I can't print them to my screen. How do I extract the four arguments for the position of the $i^{th}$ nonzero value?
In [28]:
    
i_nonzero = np.nonzero(bhm)
    
In [29]:
    
counts = bhm[i_nonzero]
    
In [30]:
    
counts[600:800]
    
    Out[30]:
How do I find the element at the top there that is equal to three? It is at the 627$^{th}$ index position in i_nonzero and counts. How do I find the corresponding indices in bicorr_hist_master?
In [31]:
    
i = 637
counts[637]
    
    Out[31]:
In [32]:
    
counts[i]
    
    Out[32]:
In [33]:
    
i_nonzero[0][i]
    
    Out[33]:
In [34]:
    
i_nonzero[1][i]
    
    Out[34]:
In [35]:
    
i_nonzero[2][i]
    
    Out[35]:
In [36]:
    
i_nonzero[3][i]
    
    Out[36]:
Can I call them all at once? It doesn't seem so...
In [37]:
    
i_nonzero[:][i]
    
    
In [41]:
    
bhm[100,1,219,211]
    
    Out[41]:
In [42]:
    
sparse_bhm = np.zeros(num_nonzero,dtype=sparseType)
    
In [45]:
    
for i in tqdm(np.arange(0,num_nonzero),ascii=True):
    sparse_bhm[i]['pair_i']  = i_nonzero[0][i]
    sparse_bhm[i]['type_i']  = i_nonzero[1][i]
    sparse_bhm[i]['det1t_i'] = i_nonzero[2][i]
    sparse_bhm[i]['det2t_i'] = i_nonzero[3][i]
    sparse_bhm[i]['count']   = counts[i]
              
print(sparse_bhm[0:20])
    
    
    
In [46]:
    
sparse_bhm[0]
    
    Out[46]:
In [47]:
    
bhm[0,0,338,745]
    
    Out[47]:
Functionalize this in my bicorr module.
In [48]:
    
print(inspect.getsource(bicorr.generate_sparse_bhm))
    
    
Start over and try generating sparse_bhm from bicorr_hist_master.
In [49]:
    
sparse_bhm = bicorr.generate_sparse_bhm(bhm)
    
    
In [50]:
    
sparse_bhm[0:20]
    
    Out[50]:
sparse_bhm, dt_bin_edges to disk and reloadIn order to make this useful, I need a clean and simple way of storing the sparse matrix to disk and reloading it. I used to save the following three variables to disk:
bicorr_hist_masterdict_pair_to_indexdt_bin_edgesInstead I will save the following three variables:
sparse_bhmdet_dfdt_bin_edgesUse the same np.save technique and try it here:
In [43]:
    
np.savez('sparse_bhm', det_df = det_df, dt_bin_edges = dt_bin_edges, sparse_bhm=sparse_bhm)
    
This went much faster, as the bicorr_hist_master was 15 GB in size and this is only 30 MB in size. What a difference!
Write this into a function, and provide an optional destination folder for saving.
In [51]:
    
help(bicorr.save_sparse_bhm)
    
    
In [53]:
    
bicorr.save_sparse_bhm(sparse_bhm, dt_bin_edges, '../datar')
    
In [54]:
    
sys.getsizeof(bhm)
    
    Out[54]:
In [55]:
    
sys.getsizeof(sparse_bhm)
    
    Out[55]:
In [58]:
    
sys.getsizeof(sparse_bhm)/sys.getsizeof(bhm)
    
    Out[58]:
This is for a partial data set. In a larger data set (tested separately), I reduced my storage needs to 1.5%... hurrah! That is a significant change, and makes it possible to store the data on my local machine.
In [49]:
    
os.listdir()
    
    Out[49]:
In [50]:
    
help(bicorr.load_sparse_bhm)
    
    
In [9]:
    
os.listdir('subfolder')
    
    Out[9]:
In [10]:
    
sparse_bhm, det_df, dt_bin_edges = bicorr.load_sparse_bhm(filepath = "subfolder")
    
In [12]:
    
who
    
    
In [13]:
    
sparse_bhm.size
    
    Out[13]:
In [16]:
    
print(sparse_bhm[0:10])
    
    
bicorr_hist_masterThe only dimension in the size of bicorr_hist_master that would change would be the number of time bins. Otherwise, the number of detector pairs and the number of interaction types I am recording will stay the same. Use the functions I have already developed to allocate the array for bicorr_hist_master.
In [21]:
    
bicorr_hist_master = bicorr.alloc_bhm(len(det_df),4,len(dt_bin_edges)-1)
    
Now I need to loop through sparse_bhm and fill bicorr_hist_master with the count at the corresponding index. What does sparse_bhm look like again?
In [19]:
    
sparse_bhm[0]
    
    Out[19]:
Fill one element
In [20]:
    
i = 0
bicorr_hist_master[sparse_bhm[i][0],sparse_bhm[i][1],sparse_bhm[i][2],sparse_bhm[i][3]] = sparse_bhm[i][4]
    
In [21]:
    
print(bicorr_hist_master[sparse_bhm[i][0],sparse_bhm[i][1],sparse_bhm[i][2],sparse_bhm[i][3]])
    
    
Fill all of the elements
In [22]:
    
for i in tqdm(np.arange(0,sparse_bhm.size)):
    bicorr_hist_master[sparse_bhm[i][0],sparse_bhm[i][1],sparse_bhm[i][2],sparse_bhm[i][3]] = sparse_bhm[i][4]
    
    
In [23]:
    
np.max(bicorr_hist_master)
    
    Out[23]:
Plot it to see if that looks correct.
In [25]:
    
bicorr_hist_plot = bicorr.build_bicorr_hist_plot(bicorr_hist_master,dt_bin_edges)[0]
bicorr.bicorr_plot(bicorr_hist_plot, dt_bin_edges, show_flag = True)
    
    
In [26]:
    
print(inspect.getsource(bicorr.revive_sparse_bhm))
    
    
In [11]:
    
sparse_bhm, det_df, dt_bin_edges = bicorr.load_sparse_bhm(filepath="subfolder")
    
In [12]:
    
bicorr_hist_master_sparse = bicorr.revive_sparse_bhm(sparse_bhm, det_df, dt_bin_edges)
    
In [13]:
    
np.array_equal(bicorr_hist_master,bicorr_hist_master_sparse)
    
    Out[13]:
Tadaa!
In [ ]: