Reading complex .mat files.

This notebook shows an example of reading a Matlab .mat file, converting the data into a usable dictionary with loops, a simple plot of the data.

Reading a .mat file

First, we open our .mat file using loadmat from scipy.io. This function returns the variables in the .mat file as keys in a dictionary. Note that this function can only open .mat files less than version 7.3. If the version of your file is 7.3 or above, you have two options:

  1. Open the file in Matlab and save as a lower version:

     save filename.mat -v7
  2. After v. 7.3, .mat files are saved as HDF5 files. This format is a super-set of netcdf (version 4) and can be read using the h5py package.

The example file we are reading has a complex data structure. Originally the file contained many additional variables, corresponding to other mooring lines, but we have reduced it to only contain data for a single set of moorings. The same techniques could be used with a loop to read the entire file.

Python has two data types similar to Matlab's struct: dict from standard Python and record arrays or stuctured arrays from numpy. I usually use dict as it is in the basic libraries, lots of information about them can be found online, and they're relatively simple to access. numpy also provides record arrays which are more similar to Matlab's struct arrays in that they can be any shape. Because of this, they are more powerful than dict but also a bit more confusing to access and use.

To start, we'll read in the file and print the keys of the dictionary the file was saved to. These keys correspond to the variable stored in the original .mat file.


In [1]:
%matplotlib inline
import numpy as np
import scipy.io as spio
import matplotlib.pyplot as plt

matfile = '../../data/CA2009.mat'
matdata = spio.loadmat(matfile)
print matdata.keys()


['T_CA042_090603_035', 'T_CA042_090603_030', 'T_CA065_090603_050', 'T_CA015_090429_015', 'T_CA015_090429_010', 'T_CA065_090603_010', 'T_CA042_090728_010', 'T_CA042_090429_010', 'CTPO_CA015_090429_001', 'F_CA042_090429_038', 'T_CA065_090429_040', 'T_CA065_090910_030', 'T_CA065_090728_040', 'T_CA042_090603_020', 'T_CA042_090728_001', 'T_CA065_090603_020', 'T_CA015_090429_001', 'T_CA042_090429_001', 'T_CA065_090429_020', 'T_CA015_090603_015', 'T_CA042_090902_001', 'T_CA065_090429_050', 'CTPO_CA015_090728_001', 'T_CA065_090910_020', 'T_CA065_090728_050', 'F_CA042_090728_038', 'CT_CA042_090902_038', 'T_CA015_090902_015', 'T_CA015_090902_005', '__header__', 'T_CA042_090728_030', 'T_CA042_090728_035', 'T_CA065_090603_030', 'T_CA015_090603_005', 'T_CA065_090429_030', 'T_CA015_090603_001', 'CT_CA042_090728_038', 'T_CA065_090910_050', 'T_CA065_090728_060', 'T_CA065_090603_040', 'T_CA015_090902_001', 'CTPO_CA042_090603_001', 'T_CA042_090728_020', 'T_CA065_090429_001', 'CTPO_CA015_090603_001', 'CTPO_CA015_090902_001', 'T_CA065_090910_040', 'T_CA015_090902_010', 'T_CA065_090728_001', 'CTPO_CA042_090728_001', 'CTPO_CA065_090603_001', 'CTPO_CA065_090728_001', 'CTPO_CA042_090429_001', 'CTPO_CA065_090429_001', 'CTPO_CA042_090902_001', 'T_CA065_090603_065', 'T_CA042_090728_042', 'T_CA065_090603_060', 'T_CA065_090728_010', 'T_CA042_090429_042', 'CT_CA042_090603_038', 'T_CA042_090902_030', 'T_CA042_090902_035', 'T_CA065_090429_010', 'T_CA042_090902_042', 'F_CA042_090902_038', 'T_CA065_090910_065', 'T_CA065_090910_060', 'T_CA042_090603_010', 'T_CA065_090728_020', 'T_CA042_090429_035', 'T_CA042_090429_030', 'T_CA042_090902_020', 'T_CA065_090429_065', 'F_CA042_090603_038', 'T_CA065_090910_010', 'CT_CA042_090429_038', 'T_CA015_090728_010', '__version__', 'T_CA015_090728_015', 'T_CA042_090603_001', 'T_CA042_090429_020', 'T_CA065_090603_001', 'T_CA042_090902_010', 'T_CA065_090429_060', 'T_CA065_090910_001', '__globals__', 'CTPO_CA065_090902_001', 'T_CA015_090728_001', 'T_CA015_090728_005']

For this data set, each variable is composed of four parts, with each part separated by underscores: INST_STA_DATE_DEPTH

  • INST: The instrument(s) located at this part of the mooring: T is temperature; CTPO is conductivity, temperature, pressure, and oxygen; etc..
  • STA: The station location: mooring line and water depth. CA_042 is Cape Alava at 42 m depth.
  • DATE: Date the mooring instrument were recovered.
  • DEPTH: Nominal depth of the instruments along the the mooring.

Also, note that there are keys like __globals__, __version__ and __header__. These keys contain information about the .mat file, but, in general, are not used. We'll have to come up with a way to ignore them later.

Pretty-printing the data in the file

To get a better look at the data, we're going to write a small function that recursively goes into every key of a dictionary or every field of a structured array and prints the data, indenting so that we know what level we're at.


In [2]:
def print_mat_nested(d, indent=0, nkeys=0):
    """Pretty print nested structures from .mat files   
    Inspired by: `StackOverflow <http://stackoverflow.com/questions/3229419/pretty-printing-nested-dictionaries-in-python>`_
    """
    
    # Subset dictionary to limit keys to print.  Only works on first level
    if nkeys>0:
        d = {k: d[k] for k in d.keys()[:nkeys]}  # Dictionary comprehension: limit to first nkeys keys.

    if isinstance(d, dict):
        for key, value in d.iteritems():         # iteritems loops through key, value pairs
          print '\t' * indent + 'Key: ' + str(key)
          print_mat_nested(value, indent+1)

    if isinstance(d,np.ndarray) and d.dtype.names is not None:  # Note: and short-circuits by default
        for n in d.dtype.names:    # This means it's a struct, it's bit of a kludge test.
            print '\t' * indent + 'Field: ' + str(n)
            print_mat_nested(d[n], indent+1)

print_mat_nested(matdata, nkeys=10)


Key: T_CA042_090603_035
	Field: data
	Field: lpdata
Key: T_CA042_090603_030
	Field: data
	Field: lpdata
Key: T_CA065_090603_050
	Field: data
	Field: lpdata
Key: T_CA015_090429_015
	Field: data
	Field: lpdata
Key: T_CA015_090429_010
	Field: data
	Field: lpdata
Key: T_CA065_090603_010
	Field: data
	Field: lpdata
Key: T_CA042_090728_010
	Field: data
	Field: lpdata
Key: T_CA042_090429_010
	Field: data
	Field: lpdata
Key: CTPO_CA015_090429_001
	Field: data
	Field: lpdata
Key: F_CA042_090429_038
	Field: data
	Field: lpdata

OK, so each key has two fields: data and lpdata. That seems reasonable enough. The original file had the same format with each variable being a struct with these two fields. In addition, each of these fields was a nested struct containing fields like time, temp, etc.

Importing nested structured arrays from .mat file

It's a bit odd that our function didn't drill down further. It should have kept going through each level of nesting if the data at each level was a structured array or a dict. Instead, it stopped before getting to the time or temp fields.

Let's take a look at the fields of those structure arrays.


In [3]:
t = matdata['T_CA065_090603_010']   # Just look at one set of data

# Here we define a simple function to print the data type and shape.  It just saves us some repetition.
def dtype_shape_str(x):
    """ Return string containing the dtype and shape of x."""
    return str(x.dtype) + " " + str(x.shape)

print "t['data'] is a: " + dtype_shape_str(t['data'])
print "t['data'][0,0] is a: " + dtype_shape_str(t['data'][0,0])
print "t['data'][0,0]['time'] is a: " + dtype_shape_str(t['data'][0,0]['time'])
print "t['data'][0,0]['time'][0,0] is a: " + dtype_shape_str(t['data'][0,0]['time'][0,0])


t['data'] is a: object (1, 1)
t['data'][0,0] is a: [('time', 'O'), ('temp', 'O')] (1, 1)
t['data'][0,0]['time'] is a: object (1, 1)
t['data'][0,0]['time'][0,0] is a: float64 (32344, 1)

Turns out the data field of our variable is of type object. That's a bit odd... Looking closer, it is 2D with shape (1,1) so it's essentially a scalar. The actual data is nested even deeper inside the time field which is also a (1,1) structure array.

Let's just do a quick plot of the data to see what we have:


In [4]:
time = t['data'][0,0]['time'][0,0]
temp = t['data'][0,0]['temp'][0,0]
plt.plot(time,temp,'ko')
plt.xlabel('time')
plt.ylabel('temp')


Out[4]:
<matplotlib.text.Text at 0x114966390>

Great! That looks like temperature data. But why do we need all the extra indexing? This is the difference between arrays of structs and structs of arrays. A similar situation happens in Matlab, but if your struct is scalar, you don't need the extra [0,0] indices. (NB: For the scopy.io.loadmat function a squeeze_me argument exists, which, in theory, should reduce these sub-components to scalars. It seemed to work- I got scalars- but for some reason I couldn't access the results)

While arrays of structs can be powerful, it's often the case that datasets will just have nested scalar structures where the actual vector or array data is located at the base of the tree. In these cases, the extra [0,0] indexing is annoying and confusing. Fortunately, someone has already put together a function that breaks up this complex nest of data structures into something manageable. The function below loads a .mat file and converts the nested scalar structures into nested dictionaries. If you aren't using arrays of structs, then this version should work just fine for you.


In [5]:
def loadmat(filename):
    '''
    this function should be called instead of direct spio.loadmat
    as it cures the problem of not properly recovering python dictionaries
    from mat files. It calls the function check keys to cure all entries
    which are still mat-objects
    
    from: `StackOverflow <http://stackoverflow.com/questions/7008608/scipy-io-loadmat-nested-structures-i-e-dictionaries>`_
    '''
    data = spio.loadmat(filename, struct_as_record=False, squeeze_me=True)
    return _check_keys(data)

def _check_keys(dict):
    '''
    checks if entries in dictionary are mat-objects. If yes
    todict is called to change them to nested dictionaries
    '''
    for key in dict:
        if isinstance(dict[key], spio.matlab.mio5_params.mat_struct):
            dict[key] = _todict(dict[key])
    return dict        

def _todict(matobj):
    '''
    A recursive function which constructs from matobjects nested dictionaries
    '''
    dict = {}
    for strg in matobj._fieldnames:
        elem = matobj.__dict__[strg]
        if isinstance(elem, spio.matlab.mio5_params.mat_struct):
            dict[strg] = _todict(elem)
        else:
            dict[strg] = elem
    return dict

matdata = loadmat(matfile)

print_mat_nested(matdata, nkeys=6)


Key: T_CA042_090603_035
	Key: lpdata
		Key: temp
	Key: data
		Key: temp
		Key: time
Key: T_CA042_090603_030
	Key: lpdata
		Key: temp
	Key: data
		Key: temp
		Key: time
Key: T_CA065_090603_050
	Key: lpdata
		Key: temp
	Key: data
		Key: temp
		Key: time
Key: T_CA015_090429_015
	Key: lpdata
		Key: temp
	Key: data
		Key: temp
		Key: time
Key: T_CA015_090429_010
	Key: lpdata
		Key: temp
	Key: data
		Key: temp
		Key: time
Key: T_CA065_090603_010
	Key: lpdata
		Key: temp
	Key: data
		Key: temp
		Key: time

Now that's more like it!

Rearranging the data hierarchy

The next step is to combine that data into useful structures. In the original .mat file, the variable names (now keys in matdata) correspond to data from a particular instrument on a particular mooring, as explained above. We can regroup this data using python dictionaries into a form that might be a bit more useful.

Looking at the output above, there are two keys for each variable: lpdata and data. lpdata contains low-pass filtered data, but uses the same time array as data. Let's combine these into one dictionary to reduce the level a nesting a bit.

The function below does this in-place, by altering the dictionary itself. Because it's not actually creating new data, it works quickly. Note that before we do it, we copy all the original data by using the Python function deepcopy. This function copies everything. The normal copy function only copies things at the top-level.


In [6]:
# Now that the data has a hierarchy, we can start combining things.

def combine_lpdata(d):
    """Combine the low-pass and regular data to a single dictionary.  
    The input is the dictionary that contains the 'data' and 'lpdata' keys.
    
    The function works in-place, but returns a reference to the dictionary as well."""
    
    # For regular data, just pop it off the dictionary, keeping the same name.
    # The `pop` method returns the value of key `k` and removes it from the dictionary at the same time.
    for k in d['data'].keys():
        d[k] = d['data'].pop(k)
    d.pop('data')

    # For low-passed data, prepend 'lp' to the front of the key.
    for k in d['lpdata'].keys():
        d['lp'+k] = d['lpdata'].pop(k)
    d.pop('lpdata')

    return d

# Test the function
from copy import deepcopy
test = deepcopy(matdata['T_CA042_090603_010'])  # Copy the data for the test.
print 'keys before:', test.keys()
combine_lpdata(test)
print 'keys after:', test.keys()  # Note how we print the same thing, but the keys have changed.


# Now we'll do it on all of matdata
# The next line ignores the special variables starting with '__'
# You can't have matlab variables starting with '__' so this should only remove extra keys created by loadmat.
variables = [k for k in matdata.keys() if not k.startswith('__')]   # This is a list comprehension. 

# Records is now just a list, where each element is a string corresponding to a key of the dictionary matdata.
print variables

# We can loop through all of it, running our function
for v in variables:
    combine_lpdata(matdata[v])

# Or we can loop through just part of it to print the results
for v in variables[:4]:
    print matdata[v]


keys before: ['lpdata', 'data']
keys after: ['lptemp', 'temp', 'time']
['T_CA042_090603_035', 'T_CA042_090603_030', 'T_CA065_090603_050', 'T_CA015_090429_015', 'T_CA015_090429_010', 'T_CA065_090603_010', 'T_CA042_090728_010', 'T_CA042_090429_010', 'CTPO_CA015_090429_001', 'F_CA042_090429_038', 'T_CA065_090429_040', 'T_CA065_090910_030', 'T_CA065_090728_040', 'T_CA042_090603_020', 'T_CA042_090728_001', 'T_CA065_090603_020', 'T_CA015_090429_001', 'T_CA042_090429_001', 'T_CA065_090429_020', 'T_CA015_090603_015', 'T_CA042_090902_001', 'T_CA065_090429_050', 'CTPO_CA015_090728_001', 'T_CA065_090910_020', 'T_CA065_090728_050', 'F_CA042_090728_038', 'CT_CA042_090902_038', 'T_CA015_090902_015', 'T_CA015_090902_005', 'T_CA042_090728_030', 'T_CA042_090728_035', 'T_CA065_090603_030', 'T_CA015_090603_005', 'T_CA065_090429_030', 'T_CA015_090603_001', 'CT_CA042_090728_038', 'T_CA065_090910_050', 'T_CA065_090728_060', 'T_CA065_090603_040', 'T_CA015_090902_001', 'CTPO_CA042_090603_001', 'T_CA042_090728_020', 'T_CA065_090429_001', 'CTPO_CA015_090603_001', 'CTPO_CA015_090902_001', 'T_CA065_090910_040', 'T_CA015_090902_010', 'T_CA065_090728_001', 'CTPO_CA042_090728_001', 'CTPO_CA065_090603_001', 'CTPO_CA065_090728_001', 'CTPO_CA042_090429_001', 'CTPO_CA065_090429_001', 'CTPO_CA042_090902_001', 'T_CA065_090603_065', 'T_CA042_090728_042', 'T_CA065_090603_060', 'T_CA065_090728_010', 'T_CA042_090429_042', 'CT_CA042_090603_038', 'T_CA042_090902_030', 'T_CA042_090902_035', 'T_CA065_090429_010', 'T_CA042_090902_042', 'F_CA042_090902_038', 'T_CA065_090910_065', 'T_CA065_090910_060', 'T_CA042_090603_010', 'T_CA065_090728_020', 'T_CA042_090429_035', 'T_CA042_090429_030', 'T_CA042_090902_020', 'T_CA065_090429_065', 'F_CA042_090603_038', 'T_CA065_090910_010', 'CT_CA042_090429_038', 'T_CA015_090728_010', 'T_CA015_090728_015', 'T_CA042_090603_001', 'T_CA042_090429_020', 'T_CA065_090603_001', 'T_CA042_090902_010', 'T_CA065_090429_060', 'T_CA065_090910_001', 'CTPO_CA065_090902_001', 'T_CA015_090728_001', 'T_CA015_090728_005']
{'lptemp': array([ 8.7598,  8.759 ,  8.7583, ...,  9.1374,  9.138 ,  9.1387]), 'temp': array([ 8.87,  8.87,  8.87, ...,  9.02,  9.02,  9.02]), 'time': array([ 733927.58319444,  733927.58458333,  733927.58597222, ...,
        733972.54847222,  733972.54986111,  733972.55125   ])}
{'lptemp': array([ 8.2157,  8.2152,  8.2146, ...,  7.5484,  7.5476,  7.5469]), 'temp': array([ 8.31,  8.31,  8.31, ...,  7.38,  7.22,  7.38]), 'time': array([ 733927.58      ,  733927.58138889,  733927.58277778, ...,
        733972.54805556,  733972.54944444,  733972.55083333])}
{'lptemp': array([ 8.6618,  8.6614,  8.661 , ...,  9.0324,  9.0339,  9.0355]), 'temp': array([ 8.74,  8.74,  8.59, ...,  9.67,  9.67,  9.52]), 'time': array([ 733927.62766204,  733927.62905093,  733927.63043981, ...,
        733972.55266204,  733972.55405093,  733972.55543981])}
{'lptemp': array([  9.1326,   9.1325,   9.1324, ...,  11.872 ,  11.872 ,  11.872 ]), 'temp': array([  9.213,   9.163,   9.139, ...,  11.977,  11.953,  11.904]), 'time': array([ 733892.70236111,  733892.70375   ,  733892.70513889, ...,
        733927.47458333,  733927.47597222,  733927.47736111])}

Now let's look at the variables names, ie the mooring deployments, in a bit more detail. We want to know the unique station names/depths, the unique lists of parameters measured, and the unique depths. To do so, we'll use a Python technique called list comprehension that generates a list using a loop in a single line. We'll also use the string method split which generates a list of strings by splitting on a chosen character. Note that we're making lists within lists. Because each of the inner lists has the same dimension, we can wrap this in the numpy function array which generates a 2D ndarray type from lists of lists.


In [7]:
# The split method generates a list for each name, while the list comprehension 
# generates a super-list containing all the sub-lists.
print [v.split('_') for v in variables]  
v_array = np.array([v.split('_') for v in variables])

# Now we find the unique stations, parameters, and depths
parameters = np.unique(v_array[:,0])
stations = np.unique(v_array[:,1])
depths = np.unique(v_array[:,3])

print 'Parameters: ', parameters
print 'Stations: ', stations
print 'Depths: ', depths  # Note that they are strings


[['T', 'CA042', '090603', '035'], ['T', 'CA042', '090603', '030'], ['T', 'CA065', '090603', '050'], ['T', 'CA015', '090429', '015'], ['T', 'CA015', '090429', '010'], ['T', 'CA065', '090603', '010'], ['T', 'CA042', '090728', '010'], ['T', 'CA042', '090429', '010'], ['CTPO', 'CA015', '090429', '001'], ['F', 'CA042', '090429', '038'], ['T', 'CA065', '090429', '040'], ['T', 'CA065', '090910', '030'], ['T', 'CA065', '090728', '040'], ['T', 'CA042', '090603', '020'], ['T', 'CA042', '090728', '001'], ['T', 'CA065', '090603', '020'], ['T', 'CA015', '090429', '001'], ['T', 'CA042', '090429', '001'], ['T', 'CA065', '090429', '020'], ['T', 'CA015', '090603', '015'], ['T', 'CA042', '090902', '001'], ['T', 'CA065', '090429', '050'], ['CTPO', 'CA015', '090728', '001'], ['T', 'CA065', '090910', '020'], ['T', 'CA065', '090728', '050'], ['F', 'CA042', '090728', '038'], ['CT', 'CA042', '090902', '038'], ['T', 'CA015', '090902', '015'], ['T', 'CA015', '090902', '005'], ['T', 'CA042', '090728', '030'], ['T', 'CA042', '090728', '035'], ['T', 'CA065', '090603', '030'], ['T', 'CA015', '090603', '005'], ['T', 'CA065', '090429', '030'], ['T', 'CA015', '090603', '001'], ['CT', 'CA042', '090728', '038'], ['T', 'CA065', '090910', '050'], ['T', 'CA065', '090728', '060'], ['T', 'CA065', '090603', '040'], ['T', 'CA015', '090902', '001'], ['CTPO', 'CA042', '090603', '001'], ['T', 'CA042', '090728', '020'], ['T', 'CA065', '090429', '001'], ['CTPO', 'CA015', '090603', '001'], ['CTPO', 'CA015', '090902', '001'], ['T', 'CA065', '090910', '040'], ['T', 'CA015', '090902', '010'], ['T', 'CA065', '090728', '001'], ['CTPO', 'CA042', '090728', '001'], ['CTPO', 'CA065', '090603', '001'], ['CTPO', 'CA065', '090728', '001'], ['CTPO', 'CA042', '090429', '001'], ['CTPO', 'CA065', '090429', '001'], ['CTPO', 'CA042', '090902', '001'], ['T', 'CA065', '090603', '065'], ['T', 'CA042', '090728', '042'], ['T', 'CA065', '090603', '060'], ['T', 'CA065', '090728', '010'], ['T', 'CA042', '090429', '042'], ['CT', 'CA042', '090603', '038'], ['T', 'CA042', '090902', '030'], ['T', 'CA042', '090902', '035'], ['T', 'CA065', '090429', '010'], ['T', 'CA042', '090902', '042'], ['F', 'CA042', '090902', '038'], ['T', 'CA065', '090910', '065'], ['T', 'CA065', '090910', '060'], ['T', 'CA042', '090603', '010'], ['T', 'CA065', '090728', '020'], ['T', 'CA042', '090429', '035'], ['T', 'CA042', '090429', '030'], ['T', 'CA042', '090902', '020'], ['T', 'CA065', '090429', '065'], ['F', 'CA042', '090603', '038'], ['T', 'CA065', '090910', '010'], ['CT', 'CA042', '090429', '038'], ['T', 'CA015', '090728', '010'], ['T', 'CA015', '090728', '015'], ['T', 'CA042', '090603', '001'], ['T', 'CA042', '090429', '020'], ['T', 'CA065', '090603', '001'], ['T', 'CA042', '090902', '010'], ['T', 'CA065', '090429', '060'], ['T', 'CA065', '090910', '001'], ['CTPO', 'CA065', '090902', '001'], ['T', 'CA015', '090728', '001'], ['T', 'CA015', '090728', '005']]
Parameters:  ['CT' 'CTPO' 'F' 'T']
Stations:  ['CA015' 'CA042' 'CA065']
Depths:  ['001' '005' '010' '015' '020' '030' '035' '038' '040' '042' '050' '060'
 '065']

Assigning the data to dictionaries.

We're now going to re-assign the data to nested dictionaries based on the variable (ie mooring deployment) names. While this isn't strictly necessary, it may help with looping later on and gives us a chance to demonstrate some Pythonic code.

The new nested dictionaries will look like matdata2[station][param][depth][date][data] where each set of brackets is a new nested dictionary.


In [8]:
# First, we'll combine the data sets for each unique combination of instruments, station, and depth.
matdata2 = {}  # Creates an empty dictionary

for k,v in matdata.iteritems():  # Here's way to loop over key,value pairs of dictionaries.
    if  k.startswith('__'):  # Skip the special keys
        continue             # Continue goes to the next loop iteration.
    param, station, date, depth = k.split('_')  # This splits up the string, assigning each of the parts to different variables.
    
    # The if statements here just check to make sure that level of the dictionary has already been created.
    # If all levels are present, we can just assign a new key.  If not, we have to create a new dictionary
    # at that level.
    if station in matdata2:
        if param in matdata2[station]:
            if depth in matdata2[station][param]:
                matdata2[station][param][depth][date] = v  # We don't need to check depth b/c we're assigning it.
            else:
                matdata2[station][param][depth] = {date: v}
        else:
            matdata2[station][param] = {depth: {date: v}}
    else:
        matdata2[station] = {param: {depth: {date: v}}}

matdata2 now has the data sorted as nested dictionaries as opposed to a single, flat set of data. This (can) make looping over related data easier.

Combining dates into a single vector

First, we'll combine our data into a single vector by concatenating the different deployment dates. We'll add a np.nan between each segment so that it's discontinuous when plotting. We'll generate a list of arrays and then use np.hstack to concatenate all the lists into a single array.


In [9]:
matdata3 = deepcopy(matdata2)

for station in matdata3.itervalues():  # Loops over values (in this case, the sub-dictionaries)
    for params in station.itervalues():  # OK, the param dict is the one we're going to modify
        for depth in params.itervalues():
            for date in sorted(depth.keys()):  # Sort so that the dates order properly
                for v in depth[date].keys():
                    if v not in depth:
                        depth[v] = [depth[date].pop(v)]  # Initialize the list with the first value
                    else:
                        depth[v].append(np.array(np.nan)) # Put nans between lists to break-up timeseries
                        depth[v].append(depth[date].pop(v))
                depth.pop(date)
            
            # Now we have to convert our lists of ndarrays into a single array
            for v in depth.keys():  # depths no longer contains the dates, but the variables now
                depth[v] = np.hstack(tuple(depth[v])) # hstack expects a tuple, not a list

Interpolating the depths to a common time

At this point we have dictionaries nested three-layers deep. Station->Parameter Set -> Depth -> time, temp, lptemp, etc. Now we have to make some decisions based on what we actually want to do with the data. For this example, we're going to combine the data from the different depths in single 2-D arrays for each parameter set. We really only need to concern ourselves with the temperature data at this point, as it's the only instrument located at various points along the mooring.

Note the the times are different for each instrument located along a mooring line. To combine them into a single 2D-array, we'll need to do some interpolation. To do that, we need to know what our times look like. We'll print out the start and end dates of each station, along with the different time-intervals for measurements.


In [10]:
from matplotlib.dates import num2date   # This converts matplotlib date-nums to Python date objects

for name, station in matdata3.iteritems():
    T = station['T']   # Just get the temperatures
    station_mint = np.inf
    station_maxt = -np.inf
    print name + ': '
    
    # Loop through depths find start/stop times and mean dt
    for depth in sorted(T.keys()):
        time = T[depth]['time']
        mint = np.floor(np.nanmin(time))
        maxt = np.ceil(np.nanmax(time))
        station_mint = min(mint,station_mint)  # Note difference in min functions (Also, np.fmin)
        station_maxt = max(maxt,station_maxt)  # Note difference in min function
        
        dt   = np.nanmean(np.diff(time)) *24*60  # Convert days to minutes
        print u'\t {}: {} to {}.  ∆t = {:f} min'.format(depth, num2date(mint).strftime('%Y-%m-%d'),
                                                               num2date(maxt).strftime('%Y-%m-%d'), dt)

    print '{}: {} to {}'.format(name, num2date(station_mint).strftime('%Y-%m-%d'), 
                                      num2date(station_maxt).strftime('%Y-%m-%d'))


CA065: 
	 001: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 010: 2010-04-30 to 2010-09-22.  ∆t = 1.853680 min
	 020: 2010-04-30 to 2010-09-22.  ∆t = 1.853608 min
	 030: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 040: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 050: 2010-04-30 to 2010-09-22.  ∆t = 1.853621 min
	 060: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 065: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
CA065: 2010-04-30 to 2010-10-09
CA042: 
	 001: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 010: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 020: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 030: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 035: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 042: 2010-04-30 to 2010-09-22.  ∆t = 1.327393 min
CA042: 2010-04-30 to 2010-10-09
CA015: 
	 001: 2010-04-30 to 2010-10-09.  ∆t = 2.000000 min
	 005: 2010-06-04 to 2010-10-09.  ∆t = 2.000000 min
	 010: 2010-04-30 to 2010-10-09.  ∆t = 1.663187 min
	 015: 2010-04-30 to 2010-09-22.  ∆t = 1.764856 min
CA015: 2010-04-30 to 2010-10-09

Now we interpolate!

Ok, it looks like our data falls between April 30 and Oct 10, 2010 with a roughly 2 min period. Lets' use this to create a new time variable and interpolate all our data to it.


In [11]:
# Finally, we'll interpolate to a common timestep so that each station and depth can use the same timestep
matdata4 = deepcopy(matdata3)
from matplotlib.dates import datestr2num

# Create the time array
time =np.arange(datestr2num('2010-04-30'), datestr2num('2010-10-09'), 2.0/24./60.)  # 2 minute intervals
nt = len(time)

for key, station in matdata4.iteritems():
    T = station['T']  # Only looping through temps for example
    depths = sorted(T.keys())
    nd = len(depths)
    data = np.empty((nd,nt))  # Pre-allocate space for the array.  Note the double (()) - the argument is a tuple!
    lpdata = np.ones((nd,nt))  # Pre-allocate, slightly slower.

    # Note that in numpy, interp is (xi, x, y) vs. matlab (x, y, xi) 
    # where x,y is the original and xi is where to interpolate the y to.
    for i,d in enumerate(depths):
        data[i,:] = np.interp(time, T[d]['time'], T[d]['temp'])
        lpdata[i] = np.interp(time, T[d]['time'], T[d]['lptemp'])  # NB: for 2D arrays: [i]  = [i][:] = [i,:]
    
    # This is one way to convert a list of strings to floating-point array
    depth_as_floats = np.array([float(i) for i in depths])

    # Now we reassign the data
    matdata4[key] = {'time':time, 'temp':data, 'lptemp':lpdata, 'depth':depth_as_floats}

Creating a simple plot

Finally, we'll create a simple plot to see what the results look like.


In [12]:
from matplotlib.dates import MonthLocator

f,ax = plt.subplots(3,1)
f.set_size_inches(11,8.5)

for a, s in zip(ax, sorted(matdata4.keys())):
    a.plot(matdata4[s]['time'], matdata4[s]['lptemp'].T,'-')
    a.set_ylabel('T [$^\circ$C]')
    a.set_title(s, loc='left')
    # This sticks the legend outside of the axes. Note that if you don't adjust
    # the axis width, the legend will be 'off the page'.
    L =a.legend(matdata4[s]['depth'], bbox_to_anchor=(1.05, 1), loc=2) 
    L.set_title('Depth [m]')
    f.subplots_adjust(right=0.8) # adjust axis width
    
ax[-1].xaxis_date()
for a in ax[:-1]:
    a.set_xticks(ax[-1].get_xticks())
    a.set_xticklabels([])