In [ ]:
#!git clone https://github.com/met-office-lab/asn_data_utils.git
from asn_data_utils.asn_utils.Loader import Loader
l = Loader()
fs = l.list_files("mogreps")
Now we can make two sets of files.
This first one gives us a list of file paths on the same node as this Notebook i.e. local loading. We're just doing this for one day.
In [ ]:
local_fs_for_20161002T0000Z = [f for f in fs if '20161002T0000Z' in f and '_000_' not in f]
And this second one gives us the same files but now for thier paths on the Dask nodes. All files and can be accessed from any node.
In [ ]:
node_fs_for_20161002T0000Z = [f.replace('/usr/local/share/notebooks/', '/') for f in local_fs_for_20161002T0000Z]
We're going to use Iris, the Met Office's Python data analysis module. Here's it's going to load a bunch of the data files locally into separate data objects, called cubes
In [ ]:
import iris
ds = iris.load_raw(fs_for_20161002T0000Z[:30], "precipitation_amount")
In [ ]:
print(ds)
Each cube is a numpy array (ds[0].data
) with associated metadata. Using the magic of Iris we can turn these into one cube.
In [ ]:
d = ds.merge_cube()
print(d)
So this now looks like one cube, but its actually made up of lots of separate cubes from separate files...
...so can we put these files on different nodes?
Is to see if you can do distributed processing on this data. We've got another 1EB of it, so it would be rather lovely if we could crack this!