Study of lossy floating point compression with fpack

Stephen Bailey, Spring 2015

Summary

  1. fpack is a registered FITS convention for compressing the data portion of HDUs while leaving the headers uncompressed. It supports lossless compression of integer data and either lossless or lossy compression of floating point data.
  2. lossless gzip compression of random floating point data isn't worth it: <5% compression
  3. converting double to float is a form of lossy compression: 2x compression with O(1e-8) loss of precision
  4. fpack can do ~40x better in precision or ~10% better in compression, at a considerable cost in complexity (see note)

Recommendation for DESI

  1. Use fpack for lossless compression of integer data, e.g. raw data and masks
  2. Use float for lossy compression of double precision data
  3. Have I/O routines automatically convert to/from float; all calculation should be done in double

Note: Our pipeline is designed around each step writing a file and having the next step read that file to keep going. At the same time, we would like to reserve the possibility of skipping the read by directly using the results that are already in memory. The loss of precision from using float in the I/O instead of double is readily reproducible in memory; I don't know of a viable way to mimic the fpack loss of precision without actually writing the file.

float vs. double as a lossy compression


In [2]:
#- Loss of precision
import numpy as np
dd = np.random.normal(scale=1, loc=1, size=(500, 4000))
df = dd.astype(np.float32)
df_precision = np.std(dd-df)
print df_precision


3.58395842373e-08

fpack lossy compression


In [3]:
#- compress a dataset; return compression factor and loss of precision
from astropy.io import fits
def testpack(data, compression_type='RICE_1', quantize_level=16.0):
    hdu = fits.CompImageHDU(data,
                 compression_type=compression_type, quantize_level=quantize_level,
                 clobber=True)
    hdu.writeto('blat.fits', clobber=True)
    dx = fits.getdata('blat.fits')
    mbx = os.stat('blat.fits').st_size / 1024.**2
    compression = (data.size*data.dtype.itemsize/1024.**2) / mbx
    return compression, np.std(data-dx)

In [7]:
print 'quantize=16', testpack(dd, 'GZIP_1', 16)
print 'quantize=10**9', testpack(dd, 'GZIP_1', 10**9)


quantize=16 (5.922767116796967, 0.017874087328688743)
quantize=10**9 (1.0376457892333872, 0.0)

In [6]:
#- Also test with DESI spectral data instead of pure random data
#- (hardwired filepath, not portable)
infile = '/data/desi/spectro/redux/sjb/dogwood/exposures/20150211/00000002/cframe-r0-00000002.fits'
spectra = fits.getdata(infile, 'FLUX')
print 'quantize=16', testpack(spectra, 'GZIP_1', 16)
print 'quantize=10**9', testpack(spectra, 'GZIP_1', 10**9)


quantize=16 (5.508260814730735, 0.011958782299361802)
quantize=10**9 (1.0398066804752042, 0.0)

In [32]:
#- Scan through the quantization parameter
#- Return arrays of q, compression_factor, loss_of_precision
def scanq(data, compression_type='GZIP_1'):
    results = list()
    qx = 10**np.arange(9,4,-0.5)
    results = [testpack(data, compression_type=compression_type, quantize_level=q) for q in qx]
    compression, precision = zip(*results)
    return np.asarray(qx), np.asarray(compression), np.asarray(precision)

In [44]:
%pylab inline
def plotscan(data, comptype='GZIP_1', extra_noise=0.0, pre_compression=1.0):
    qx, compression, precision = scanq(data, comptype)
    precision = np.sqrt(precision**2 + extra_noise**2)
    plot(compression*pre_compression, np.log10(precision), 'b.-')
    xlim(1, np.max(compression*pre_compression)*1.2)
    xlabel('compression factor')
    ylabel('log10(std(delta))')
    return qx, compression, precision


Populating the interactive namespace from numpy and matplotlib

In [45]:
qx, compression, precision = plotscan(dd, 'GZIP_1')
plot([2,], [np.log10(df_precision),], 'rs')


Out[45]:
[<matplotlib.lines.Line2D at 0x105b4b5d0>]

In [46]:
print 'Lossless compression', np.max(compression[precision==0.0])
px = np.std(dd-df)
print 'Precision improvement vs. float', px / np.interp(2, compression, precision)
print 'Compression improvement vs. float', np.interp(px, precision, compression)/2.0


Lossless compression 1.03764578923
Precision improvement vs. float 39.9168379234
Compression improvement vs. float 1.11205527366

In [47]:
#- comparing compression of double vs. float
qx, compression, precision = plotscan(dd, 'GZIP_1')
qx, compression, precision = plotscan(df, 'GZIP_1', extra_noise=df_precision, pre_compression=2.0)