Python provides file read write and object serialisation / reconstruction (python pickle module). numpy
provides methods for storing and retrieving structured arrays quickly and efficiently (including data compression). scipy
provides some helper functions for common file formats such as netcdf
and matlab
etc etc.
Sometimes data are hard won - and reformatting them into easily retrieved files can be a lifesaver.
In [15]:
import numpy as np
from scipy.io import netcdf
Completely portable and human readable, text-based formats are common outputs from simple scripted programs, web searches, program logs etc. Reading and writing them is trivial, and it is easy to append information to a file. The only problem is that the conversion can be extremely slow.
This example is taken from our mapping exercise and shows the benefit of converting data to binary formats.
# Seafloor age data and global image - data from Earthbyters
# The data come as ascii lon / lat / age tuples with NaN for no data.
# This can be loaded with ...
age = np.loadtxt("global_age_data.3.6.xyz")
age_data = age.reshape(1801,3601,3) # I looked at the data and figured out what numbers to use
age_img = age_data[:,:,2]
# But this is super slow, so I have just stored the Age data on the grid (1801 x 3601) which we can reconstruct easily
datasize = (1801, 3601, 3)
age_data = np.empty(datasize)
ages = np.load("global_age_data.3.6.z.npz")["ageData"]
lats = np.linspace(90, -90, datasize[0])
lons = np.linspace(-180.0,180.0, datasize[1])
arrlons,arrlats = np.meshgrid(lons, lats)
age_data[...,0] = arrlons[...]
age_data[...,1] = arrlats[...]
age_data[...,2] = ages[...]
The timing comparison is astonishing
On my laptop the numpy binary file is about a million times faster to read. I cut out the lat/lon values from this file to save some space, but this would add, at most, a factor of 3 to the npz timing.
In [56]:
%%timeit
gadtxt = np.loadtxt("../../Data/Reference/global_age_data.3.6.xyz")
In [54]:
%%timeit
gadnpy = np.load("../../Data/Reference/global_age_data.3.6.z.npz")
In [45]:
nf = netcdf.netcdf_file(filename="../../Data/Reference/velocity_AU.nc")
from pprint import pprint # pretty printer for python objects
pprint( nf.dimensions )
pprint( nf.variables )
print nf.variables["lat"].data.shape
print nf.variables["lon"].data.shape
print nf.variables['ve'].data.shape
print nf.variables['vn'].data.shape
In [ ]: