Study of lossy floating point compression with fpack

Stephen Bailey, Spring 2015

Summary

fpack is a registered FITS convention for compressing the data portion of HDUs while leaving the headers uncompressed. It supports lossless compression of integer data and either lossless or lossy compression of floating point data.
lossless gzip compression of random floating point data isn't worth it: <5% compression
converting double to float is a form of lossy compression: 2x compression with O(1e-8) loss of precision
fpack can do ~40x better in precision or ~10% better in compression, at a considerable cost in complexity (see note)

Recommendation for DESI

Use fpack for lossless compression of integer data, e.g. raw data and masks
Use float for lossy compression of double precision data
Have I/O routines automatically convert to/from float; all calculation should be done in double

Note: Our pipeline is designed around each step writing a file and having the next step read that file to keep going. At the same time, we would like to reserve the possibility of skipping the read by directly using the results that are already in memory. The loss of precision from using float in the I/O instead of double is readily reproducible in memory; I don't know of a viable way to mimic the fpack loss of precision without actually writing the file.

float vs. double as a lossy compression



In [2]:

    
#- Loss of precision
import numpy as np
dd = np.random.normal(scale=1, loc=1, size=(500, 4000))
df = dd.astype(np.float32)
df_precision = np.std(dd-df)
print df_precision









    



3.58395842373e-08

fpack lossy compression



In [3]:

    
#- compress a dataset; return compression factor and loss of precision
from astropy.io import fits
def testpack(data, compression_type='RICE_1', quantize_level=16.0):
    hdu = fits.CompImageHDU(data,
                 compression_type=compression_type, quantize_level=quantize_level,
                 clobber=True)
    hdu.writeto('blat.fits', clobber=True)
    dx = fits.getdata('blat.fits')
    mbx = os.stat('blat.fits').st_size / 1024.**2
    compression = (data.size*data.dtype.itemsize/1024.**2) / mbx
    return compression, np.std(data-dx)



In [7]:

    
print 'quantize=16', testpack(dd, 'GZIP_1', 16)
print 'quantize=10**9', testpack(dd, 'GZIP_1', 10**9)









    



quantize=16 (5.922767116796967, 0.017874087328688743)
quantize=10**9 (1.0376457892333872, 0.0)



In [6]:

    
#- Also test with DESI spectral data instead of pure random data
#- (hardwired filepath, not portable)
infile = '/data/desi/spectro/redux/sjb/dogwood/exposures/20150211/00000002/cframe-r0-00000002.fits'
spectra = fits.getdata(infile, 'FLUX')
print 'quantize=16', testpack(spectra, 'GZIP_1', 16)
print 'quantize=10**9', testpack(spectra, 'GZIP_1', 10**9)









    



quantize=16 (5.508260814730735, 0.011958782299361802)
quantize=10**9 (1.0398066804752042, 0.0)



In [32]:

    
#- Scan through the quantization parameter
#- Return arrays of q, compression_factor, loss_of_precision
def scanq(data, compression_type='GZIP_1'):
    results = list()
    qx = 10**np.arange(9,4,-0.5)
    results = [testpack(data, compression_type=compression_type, quantize_level=q) for q in qx]
    compression, precision = zip(*results)
    return np.asarray(qx), np.asarray(compression), np.asarray(precision)



In [44]:

    
%pylab inline
def plotscan(data, comptype='GZIP_1', extra_noise=0.0, pre_compression=1.0):
    qx, compression, precision = scanq(data, comptype)
    precision = np.sqrt(precision**2 + extra_noise**2)
    plot(compression*pre_compression, np.log10(precision), 'b.-')
    xlim(1, np.max(compression*pre_compression)*1.2)
    xlabel('compression factor')
    ylabel('log10(std(delta))')
    return qx, compression, precision









    



Populating the interactive namespace from numpy and matplotlib



In [45]:

    
qx, compression, precision = plotscan(dd, 'GZIP_1')
plot([2,], [np.log10(df_precision),], 'rs')









    Out[45]:





[<matplotlib.lines.Line2D at 0x105b4b5d0>]



In [46]:

    
print 'Lossless compression', np.max(compression[precision==0.0])
px = np.std(dd-df)
print 'Precision improvement vs. float', px / np.interp(2, compression, precision)
print 'Compression improvement vs. float', np.interp(px, precision, compression)/2.0









    



Lossless compression 1.03764578923
Precision improvement vs. float 39.9168379234
Compression improvement vs. float 1.11205527366



In [47]:

    
#- comparing compression of double vs. float
qx, compression, precision = plotscan(dd, 'GZIP_1')
qx, compression, precision = plotscan(df, 'GZIP_1', extra_noise=df_precision, pre_compression=2.0)