Big data processing

Here we have a look at how one can work with data that do not fit in to the memory. We are going to use xarray with dask support for this.

Imports:



In [1]:

    
import sys
sys.path.append("../")

from netCDF4 import Dataset, MFDataset
import pyfesom as pf
import numpy as np
from mpl_toolkits.basemap import Basemap
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline
from matplotlib import cm

import xarray as xr
import pandas as pd









    



/mnt/lustre01/work/ab0995/a270088/miniconda2/envs/py35/lib/python3.5/site-packages/cmocean/tools.py:76: MatplotlibDeprecationWarning: The is_string_like function was deprecated in version 2.1.
  if not mpl.cbook.is_string_like(rgbin[0]):

Loading mesh. This mesh is rotated, so we use default values. If your mesh is rotated don't forget to use abg parameter.



In [2]:

    
meshpath  ='/mnt/lustre01/work/ab0995/a270088/data/core_mesh/'
mesh = pf.load_mesh(meshpath, usepickle=False, usejoblib=True)









    



The usejoblib == True)
The joblib file for python 3 exists.
The mesh will be loaded from /mnt/lustre01/work/ab0995/a270088/data/core_mesh/joblib_mesh

Open multiple files at once. Please have a look at this page to understand what chinks are for.



In [7]:

    
data = xr.open_mfdataset('/work/ab0995/a270067/fesom_echam/core/cpl_output_02/fesom.200?.oce.mean.nc',
                         chunks={'time': 12})

Now you have a Dataset that have all the data in.



In [11]:

    
data









    Out[11]:





<xarray.Dataset>
Dimensions:  (nodes_2d: 126859, nodes_3d: 3668773, time: 120)
Coordinates:
  * time     (time) datetime64[ns] 2000-02-01 2000-03-01 2000-04-01 ...
Dimensions without coordinates: nodes_2d, nodes_3d
Data variables:
    iter     (time) int32 dask.array<shape=(120,), chunksize=(12,)>
    ssh      (time, nodes_2d) float32 dask.array<shape=(120, 126859), chunksize=(12, 126859)>
    u        (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    v        (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    w        (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    temp     (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    salt     (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>

You can see that in this version of fesom output there is a bug with shifter time stemps (times starts from '2000-02-01'). We are giong to fix it. Create time stamps with pandas:



In [13]:

    
dates = pd.date_range('2000','2010', freq='M', )
dates









    Out[13]:





DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-30', '2000-10-31',
               ...
               '2009-03-31', '2009-04-30', '2009-05-31', '2009-06-30',
               '2009-07-31', '2009-08-31', '2009-09-30', '2009-10-31',
               '2009-11-30', '2009-12-31'],
              dtype='datetime64[ns]', length=120, freq='M')

Note, that you have to put one year more in this case since the right boundary is not included. Now replace time stamps in the data by the right ones:



In [14]:

    
data.time.data = dates



In [15]:

    
data









    Out[15]:





<xarray.Dataset>
Dimensions:  (nodes_2d: 126859, nodes_3d: 3668773, time: 120)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 ...
Dimensions without coordinates: nodes_2d, nodes_3d
Data variables:
    iter     (time) int32 dask.array<shape=(120,), chunksize=(12,)>
    ssh      (time, nodes_2d) float32 dask.array<shape=(120, 126859), chunksize=(12, 126859)>
    u        (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    v        (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    w        (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    temp     (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>
    salt     (time, nodes_3d) float32 dask.array<shape=(120, 3668773), chunksize=(12, 3668773)>

Good, we now have right time stamps and can work with time.

Time mean

The time mean over the whole time period is simple:



In [16]:

    
temp_mean = data.temp.mean(dim='time')

Here we select temp variable and apply mean to it. You also have to specify the dimention (dim) that you want to make a mean over. You probably noticed that "computation" was performed very quickly. This is because there were now computation at all, just preparation for it. To actually do computation do:



In [17]:

    
temp_mean = temp_mean.compute()



In [18]:

    
temp_mean









    Out[18]:





<xarray.DataArray 'temp' (nodes_3d: 3668773)>
array([-1.484545, -0.529427, -1.610458, ...,  0.578853,  0.434934,  0.435094],
      dtype=float32)
Dimensions without coordinates: nodes_3d

Mean over time slice

One can use slices to select data over some time period:



In [21]:

    
data.temp.sel(time=slice('2000-01-01', '2003-12-31')).time









    Out[21]:





<xarray.DataArray 'time' (time: 48)>
array(['2000-01-31T00:00:00.000000000', '2000-02-29T00:00:00.000000000',
       '2000-03-31T00:00:00.000000000', '2000-04-30T00:00:00.000000000',
       '2000-05-31T00:00:00.000000000', '2000-06-30T00:00:00.000000000',
       '2000-07-31T00:00:00.000000000', '2000-08-31T00:00:00.000000000',
       '2000-09-30T00:00:00.000000000', '2000-10-31T00:00:00.000000000',
       '2000-11-30T00:00:00.000000000', '2000-12-31T00:00:00.000000000',
       '2001-01-31T00:00:00.000000000', '2001-02-28T00:00:00.000000000',
       '2001-03-31T00:00:00.000000000', '2001-04-30T00:00:00.000000000',
       '2001-05-31T00:00:00.000000000', '2001-06-30T00:00:00.000000000',
       '2001-07-31T00:00:00.000000000', '2001-08-31T00:00:00.000000000',
       '2001-09-30T00:00:00.000000000', '2001-10-31T00:00:00.000000000',
       '2001-11-30T00:00:00.000000000', '2001-12-31T00:00:00.000000000',
       '2002-01-31T00:00:00.000000000', '2002-02-28T00:00:00.000000000',
       '2002-03-31T00:00:00.000000000', '2002-04-30T00:00:00.000000000',
       '2002-05-31T00:00:00.000000000', '2002-06-30T00:00:00.000000000',
       '2002-07-31T00:00:00.000000000', '2002-08-31T00:00:00.000000000',
       '2002-09-30T00:00:00.000000000', '2002-10-31T00:00:00.000000000',
       '2002-11-30T00:00:00.000000000', '2002-12-31T00:00:00.000000000',
       '2003-01-31T00:00:00.000000000', '2003-02-28T00:00:00.000000000',
       '2003-03-31T00:00:00.000000000', '2003-04-30T00:00:00.000000000',
       '2003-05-31T00:00:00.000000000', '2003-06-30T00:00:00.000000000',
       '2003-07-31T00:00:00.000000000', '2003-08-31T00:00:00.000000000',
       '2003-09-30T00:00:00.000000000', '2003-10-31T00:00:00.000000000',
       '2003-11-30T00:00:00.000000000', '2003-12-31T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 ...
Attributes:
    long_name:  time

Mean over this slice will look like this:



In [22]:

    
temp_mean_3years = data.temp.sel(time=slice('2000-01-01', '2003-12-31')).mean(dim='time')
temp_mean_3years = temp_mean_3years.compute()

Mean over specific month

!!! PLESE READ THIS TO GET MORE INFORMATION !!!

Our data are monthly:



In [26]:

    
data.time[:14]









    Out[26]:





<xarray.DataArray 'time' (time: 14)>
array(['2000-01-31T00:00:00.000000000', '2000-02-29T00:00:00.000000000',
       '2000-03-31T00:00:00.000000000', '2000-04-30T00:00:00.000000000',
       '2000-05-31T00:00:00.000000000', '2000-06-30T00:00:00.000000000',
       '2000-07-31T00:00:00.000000000', '2000-08-31T00:00:00.000000000',
       '2000-09-30T00:00:00.000000000', '2000-10-31T00:00:00.000000000',
       '2000-11-30T00:00:00.000000000', '2000-12-31T00:00:00.000000000',
       '2001-01-31T00:00:00.000000000', '2001-02-28T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 ...
Attributes:
    long_name:  time

The sel allows to provide explicit time steps, so if we just select only March values:



In [27]:

    
data.time[2::12]









    Out[27]:





<xarray.DataArray 'time' (time: 10)>
array(['2000-03-31T00:00:00.000000000', '2001-03-31T00:00:00.000000000',
       '2002-03-31T00:00:00.000000000', '2003-03-31T00:00:00.000000000',
       '2004-03-31T00:00:00.000000000', '2005-03-31T00:00:00.000000000',
       '2006-03-31T00:00:00.000000000', '2007-03-31T00:00:00.000000000',
       '2008-03-31T00:00:00.000000000', '2009-03-31T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-03-31 2001-03-31 2002-03-31 ...
Attributes:
    long_name:  time

we can provide this values directly to sel. We also make a mean over the selected time and do the computation:



In [29]:

    
temp_march_mean = data.temp.sel(time=data.time[2::12]).mean(dim='time')
temp_march_mean = temp_march_mean.compute()

The xarray have more explicit syntax to select months (returns array that show which record in your array corespond to each month):



In [30]:

    
data['time.month']









    Out[30]:





<xarray.DataArray 'month' (time: 120)>
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 ...

Using this sysntax you can easily select March:



In [32]:

    
data.temp[data['time.month']==3]









    Out[32]:





<xarray.DataArray 'temp' (time: 10, nodes_3d: 3668773)>
dask.array<shape=(10, 3668773), dtype=float32, chunksize=(1, 3668773)>
Coordinates:
  * time     (time) datetime64[ns] 2000-03-31 2001-03-31 2002-03-31 ...
Dimensions without coordinates: nodes_3d
Attributes:
    description:  mean potential temperature
    units:        degC

And make a mean over this month:



In [33]:

    
temp_march_mean = data.temp[data['time.month']==3].mean(dim='time')
temp_march_mean = temp_march_mean.compute()

Please have a look at this page to see what "datetime components" are supported. At the time of this writing the list contains: “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”. Additional xarray component is season. You can select winter temperature values and average over them by:



In [35]:

    
temp_DJF_mean = data.temp[data['time.season']=='DJF'].mean(dim='time')
temp_DJF_mean = temp_DJF_mean.compute()

Resampling

Once again, please read this page to get more information. If we would like to resample our data, making yearly means, the way to do it is:



In [40]:

    
yearly_data = data.resample(time='1A').mean(dim='time')



In [37]:

    
yearly_data









    Out[37]:





<xarray.Dataset>
Dimensions:  (nodes_2d: 126859, nodes_3d: 3668773, time: 10)
Coordinates:
  * time     (time) datetime64[ns] 2000-12-31 2001-12-31 2002-12-31 ...
Dimensions without coordinates: nodes_2d, nodes_3d
Data variables:
    iter     (time) float64 dask.array<shape=(10,), chunksize=(1,)>
    ssh      (time, nodes_2d) float32 dask.array<shape=(10, 126859), chunksize=(1, 126859)>
    u        (time, nodes_3d) float32 dask.array<shape=(10, 3668773), chunksize=(1, 3668773)>
    v        (time, nodes_3d) float32 dask.array<shape=(10, 3668773), chunksize=(1, 3668773)>
    w        (time, nodes_3d) float32 dask.array<shape=(10, 3668773), chunksize=(1, 3668773)>
    temp     (time, nodes_3d) float32 dask.array<shape=(10, 3668773), chunksize=(1, 3668773)>
    salt     (time, nodes_3d) float32 dask.array<shape=(10, 3668773), chunksize=(1, 3668773)>



In [38]:

    
yearly_data = yearly_data.compute()

Complete list of frequencies can be found here. Most important for us are:

A - year
M - month
D - day
H - hour



In [ ]: