.mat
files.This notebook shows an example of reading a Matlab .mat
file, converting the data into a usable dictionary with loops, a simple plot of the data.
First, we open our .mat
file using loadmat
from scipy.io
. This function returns the variables in the .mat
file as keys in a dictionary. Note that this function can only open .mat
files less than version 7.3. If the version of your file is 7.3 or above, you have two options:
Open the file in Matlab and save as a lower version:
save filename.mat -v7
After v. 7.3, .mat
files are saved as HDF5
files. This format is a super-set of netcdf
(version 4) and can be read using the h5py
package.
The example file we are reading has a complex data structure. Originally the file contained many additional variables, corresponding to other mooring lines, but we have reduced it to only contain data for a single set of moorings. The same techniques could be used with a loop to read the entire file.
Python has two data types similar to Matlab's struct
: dict
from standard Python and record arrays
or stuctured arrays
from numpy
. I usually use dict
as it is in the basic libraries, lots of information about them can be found online, and they're relatively simple to access. numpy
also provides record arrays
which are more similar to Matlab's struct
arrays in that they can be any shape. Because of this, they are more powerful than dict
but also a bit more confusing to access and use.
To start, we'll read in the file and print the keys of the dictionary the file was saved to. These keys correspond to the variable stored in the original .mat
file.
In [1]:
%matplotlib inline
import numpy as np
import scipy.io as spio
import matplotlib.pyplot as plt
matfile = '../../data/CA2009.mat'
matdata = spio.loadmat(matfile)
print matdata.keys()
For this data set, each variable is composed of four parts, with each part separated by underscores: INST_STA_DATE_DEPTH
Also, note that there are keys like __globals__, __version__ and __header__. These keys contain information about the .mat
file, but, in general, are not used. We'll have to come up with a way to ignore them later.
To get a better look at the data, we're going to write a small function that recursively goes into every key of a dictionary or every field of a structured array and prints the data, indenting so that we know what level we're at.
In [2]:
def print_mat_nested(d, indent=0, nkeys=0):
"""Pretty print nested structures from .mat files
Inspired by: `StackOverflow <http://stackoverflow.com/questions/3229419/pretty-printing-nested-dictionaries-in-python>`_
"""
# Subset dictionary to limit keys to print. Only works on first level
if nkeys>0:
d = {k: d[k] for k in d.keys()[:nkeys]} # Dictionary comprehension: limit to first nkeys keys.
if isinstance(d, dict):
for key, value in d.iteritems(): # iteritems loops through key, value pairs
print '\t' * indent + 'Key: ' + str(key)
print_mat_nested(value, indent+1)
if isinstance(d,np.ndarray) and d.dtype.names is not None: # Note: and short-circuits by default
for n in d.dtype.names: # This means it's a struct, it's bit of a kludge test.
print '\t' * indent + 'Field: ' + str(n)
print_mat_nested(d[n], indent+1)
print_mat_nested(matdata, nkeys=10)
OK, so each key has two fields: data
and lpdata
. That seems reasonable enough. The original file had the same format with each variable being a struct with these two fields. In addition, each of these fields was a nested struct containing fields like time
, temp
, etc.
.mat
fileIt's a bit odd that our function didn't drill down further. It should have kept going through each level of nesting if the data at each level was a structured array or a dict. Instead, it stopped before getting to the time
or temp
fields.
Let's take a look at the fields of those structure arrays.
In [3]:
t = matdata['T_CA065_090603_010'] # Just look at one set of data
# Here we define a simple function to print the data type and shape. It just saves us some repetition.
def dtype_shape_str(x):
""" Return string containing the dtype and shape of x."""
return str(x.dtype) + " " + str(x.shape)
print "t['data'] is a: " + dtype_shape_str(t['data'])
print "t['data'][0,0] is a: " + dtype_shape_str(t['data'][0,0])
print "t['data'][0,0]['time'] is a: " + dtype_shape_str(t['data'][0,0]['time'])
print "t['data'][0,0]['time'][0,0] is a: " + dtype_shape_str(t['data'][0,0]['time'][0,0])
Turns out the data
field of our variable is of type object
. That's a bit odd... Looking closer, it is 2D with shape (1,1) so it's essentially a scalar. The actual data is nested even deeper inside the time
field which is also a (1,1) structure array.
Let's just do a quick plot of the data to see what we have:
In [4]:
time = t['data'][0,0]['time'][0,0]
temp = t['data'][0,0]['temp'][0,0]
plt.plot(time,temp,'ko')
plt.xlabel('time')
plt.ylabel('temp')
Out[4]:
Great! That looks like temperature data. But why do we need all the extra indexing? This is the difference between arrays of structs and structs of arrays. A similar situation happens in Matlab, but if your struct is scalar, you don't need the extra [0,0] indices. (NB: For the scopy.io.loadmat
function a squeeze_me
argument exists, which, in theory, should reduce these sub-components to scalars. It seemed to work- I got scalars- but for some reason I couldn't access the results)
While arrays of structs can be powerful, it's often the case that datasets will just have nested scalar structures where the actual vector or array data is located at the base of the tree. In these cases, the extra [0,0]
indexing is annoying and confusing. Fortunately, someone has already put together a function that breaks up this complex nest of data structures into something manageable. The function below loads a .mat
file and converts the nested scalar structures into nested dictionaries. If you aren't using arrays of structs, then this version should work just fine for you.
In [5]:
def loadmat(filename):
'''
this function should be called instead of direct spio.loadmat
as it cures the problem of not properly recovering python dictionaries
from mat files. It calls the function check keys to cure all entries
which are still mat-objects
from: `StackOverflow <http://stackoverflow.com/questions/7008608/scipy-io-loadmat-nested-structures-i-e-dictionaries>`_
'''
data = spio.loadmat(filename, struct_as_record=False, squeeze_me=True)
return _check_keys(data)
def _check_keys(dict):
'''
checks if entries in dictionary are mat-objects. If yes
todict is called to change them to nested dictionaries
'''
for key in dict:
if isinstance(dict[key], spio.matlab.mio5_params.mat_struct):
dict[key] = _todict(dict[key])
return dict
def _todict(matobj):
'''
A recursive function which constructs from matobjects nested dictionaries
'''
dict = {}
for strg in matobj._fieldnames:
elem = matobj.__dict__[strg]
if isinstance(elem, spio.matlab.mio5_params.mat_struct):
dict[strg] = _todict(elem)
else:
dict[strg] = elem
return dict
matdata = loadmat(matfile)
print_mat_nested(matdata, nkeys=6)
Now that's more like it!
The next step is to combine that data into useful structures. In the original .mat
file, the variable names (now keys in matdata
) correspond to data from a particular instrument on a particular mooring, as explained above. We can regroup this data using python dictionaries into a form that might be a bit more useful.
Looking at the output above, there are two keys for each variable: lpdata
and data
. lpdata
contains low-pass filtered data, but uses the same time array as data
. Let's combine these into one dictionary to reduce the level a nesting a bit.
The function below does this in-place, by altering the dictionary itself. Because it's not actually creating new data, it works quickly. Note that before we do it, we copy all the original data by using the Python function deepcopy
. This function copies everything. The normal copy
function only copies things at the top-level.
In [6]:
# Now that the data has a hierarchy, we can start combining things.
def combine_lpdata(d):
"""Combine the low-pass and regular data to a single dictionary.
The input is the dictionary that contains the 'data' and 'lpdata' keys.
The function works in-place, but returns a reference to the dictionary as well."""
# For regular data, just pop it off the dictionary, keeping the same name.
# The `pop` method returns the value of key `k` and removes it from the dictionary at the same time.
for k in d['data'].keys():
d[k] = d['data'].pop(k)
d.pop('data')
# For low-passed data, prepend 'lp' to the front of the key.
for k in d['lpdata'].keys():
d['lp'+k] = d['lpdata'].pop(k)
d.pop('lpdata')
return d
# Test the function
from copy import deepcopy
test = deepcopy(matdata['T_CA042_090603_010']) # Copy the data for the test.
print 'keys before:', test.keys()
combine_lpdata(test)
print 'keys after:', test.keys() # Note how we print the same thing, but the keys have changed.
# Now we'll do it on all of matdata
# The next line ignores the special variables starting with '__'
# You can't have matlab variables starting with '__' so this should only remove extra keys created by loadmat.
variables = [k for k in matdata.keys() if not k.startswith('__')] # This is a list comprehension.
# Records is now just a list, where each element is a string corresponding to a key of the dictionary matdata.
print variables
# We can loop through all of it, running our function
for v in variables:
combine_lpdata(matdata[v])
# Or we can loop through just part of it to print the results
for v in variables[:4]:
print matdata[v]
Now let's look at the variables names, ie the mooring deployments, in a bit more detail. We want to know the unique station names/depths, the unique lists of parameters measured, and the unique depths. To do so, we'll use a Python technique called list comprehension that generates a list using a loop in a single line. We'll also use the string method split
which generates a list of strings by splitting on a chosen character. Note that we're making lists within lists. Because each of the inner lists has the same dimension, we can wrap this in the numpy
function array
which generates a 2D ndarray
type from lists of lists.
In [7]:
# The split method generates a list for each name, while the list comprehension
# generates a super-list containing all the sub-lists.
print [v.split('_') for v in variables]
v_array = np.array([v.split('_') for v in variables])
# Now we find the unique stations, parameters, and depths
parameters = np.unique(v_array[:,0])
stations = np.unique(v_array[:,1])
depths = np.unique(v_array[:,3])
print 'Parameters: ', parameters
print 'Stations: ', stations
print 'Depths: ', depths # Note that they are strings
We're now going to re-assign the data to nested dictionaries based on the variable (ie mooring deployment) names. While this isn't strictly necessary, it may help with looping later on and gives us a chance to demonstrate some Pythonic code.
The new nested dictionaries will look like matdata2[station][param][depth][date][data] where each set of brackets is a new nested dictionary.
In [8]:
# First, we'll combine the data sets for each unique combination of instruments, station, and depth.
matdata2 = {} # Creates an empty dictionary
for k,v in matdata.iteritems(): # Here's way to loop over key,value pairs of dictionaries.
if k.startswith('__'): # Skip the special keys
continue # Continue goes to the next loop iteration.
param, station, date, depth = k.split('_') # This splits up the string, assigning each of the parts to different variables.
# The if statements here just check to make sure that level of the dictionary has already been created.
# If all levels are present, we can just assign a new key. If not, we have to create a new dictionary
# at that level.
if station in matdata2:
if param in matdata2[station]:
if depth in matdata2[station][param]:
matdata2[station][param][depth][date] = v # We don't need to check depth b/c we're assigning it.
else:
matdata2[station][param][depth] = {date: v}
else:
matdata2[station][param] = {depth: {date: v}}
else:
matdata2[station] = {param: {depth: {date: v}}}
matdata2
now has the data sorted as nested dictionaries as opposed to a single, flat set of data. This (can) make looping over related data easier.
First, we'll combine our data into a single vector by concatenating the different deployment dates. We'll add a np.nan
between each segment so that it's discontinuous when plotting. We'll generate a list of arrays and then use np.hstack
to concatenate all the lists into a single array.
In [9]:
matdata3 = deepcopy(matdata2)
for station in matdata3.itervalues(): # Loops over values (in this case, the sub-dictionaries)
for params in station.itervalues(): # OK, the param dict is the one we're going to modify
for depth in params.itervalues():
for date in sorted(depth.keys()): # Sort so that the dates order properly
for v in depth[date].keys():
if v not in depth:
depth[v] = [depth[date].pop(v)] # Initialize the list with the first value
else:
depth[v].append(np.array(np.nan)) # Put nans between lists to break-up timeseries
depth[v].append(depth[date].pop(v))
depth.pop(date)
# Now we have to convert our lists of ndarrays into a single array
for v in depth.keys(): # depths no longer contains the dates, but the variables now
depth[v] = np.hstack(tuple(depth[v])) # hstack expects a tuple, not a list
At this point we have dictionaries nested three-layers deep. Station->Parameter Set -> Depth -> time, temp, lptemp, etc. Now we have to make some decisions based on what we actually want to do with the data. For this example, we're going to combine the data from the different depths in single 2-D arrays for each parameter set. We really only need to concern ourselves with the temperature data at this point, as it's the only instrument located at various points along the mooring.
Note the the times are different for each instrument located along a mooring line. To combine them into a single 2D-array, we'll need to do some interpolation. To do that, we need to know what our times look like. We'll print out the start and end dates of each station, along with the different time-intervals for measurements.
In [10]:
from matplotlib.dates import num2date # This converts matplotlib date-nums to Python date objects
for name, station in matdata3.iteritems():
T = station['T'] # Just get the temperatures
station_mint = np.inf
station_maxt = -np.inf
print name + ': '
# Loop through depths find start/stop times and mean dt
for depth in sorted(T.keys()):
time = T[depth]['time']
mint = np.floor(np.nanmin(time))
maxt = np.ceil(np.nanmax(time))
station_mint = min(mint,station_mint) # Note difference in min functions (Also, np.fmin)
station_maxt = max(maxt,station_maxt) # Note difference in min function
dt = np.nanmean(np.diff(time)) *24*60 # Convert days to minutes
print u'\t {}: {} to {}. ∆t = {:f} min'.format(depth, num2date(mint).strftime('%Y-%m-%d'),
num2date(maxt).strftime('%Y-%m-%d'), dt)
print '{}: {} to {}'.format(name, num2date(station_mint).strftime('%Y-%m-%d'),
num2date(station_maxt).strftime('%Y-%m-%d'))
In [11]:
# Finally, we'll interpolate to a common timestep so that each station and depth can use the same timestep
matdata4 = deepcopy(matdata3)
from matplotlib.dates import datestr2num
# Create the time array
time =np.arange(datestr2num('2010-04-30'), datestr2num('2010-10-09'), 2.0/24./60.) # 2 minute intervals
nt = len(time)
for key, station in matdata4.iteritems():
T = station['T'] # Only looping through temps for example
depths = sorted(T.keys())
nd = len(depths)
data = np.empty((nd,nt)) # Pre-allocate space for the array. Note the double (()) - the argument is a tuple!
lpdata = np.ones((nd,nt)) # Pre-allocate, slightly slower.
# Note that in numpy, interp is (xi, x, y) vs. matlab (x, y, xi)
# where x,y is the original and xi is where to interpolate the y to.
for i,d in enumerate(depths):
data[i,:] = np.interp(time, T[d]['time'], T[d]['temp'])
lpdata[i] = np.interp(time, T[d]['time'], T[d]['lptemp']) # NB: for 2D arrays: [i] = [i][:] = [i,:]
# This is one way to convert a list of strings to floating-point array
depth_as_floats = np.array([float(i) for i in depths])
# Now we reassign the data
matdata4[key] = {'time':time, 'temp':data, 'lptemp':lpdata, 'depth':depth_as_floats}
In [12]:
from matplotlib.dates import MonthLocator
f,ax = plt.subplots(3,1)
f.set_size_inches(11,8.5)
for a, s in zip(ax, sorted(matdata4.keys())):
a.plot(matdata4[s]['time'], matdata4[s]['lptemp'].T,'-')
a.set_ylabel('T [$^\circ$C]')
a.set_title(s, loc='left')
# This sticks the legend outside of the axes. Note that if you don't adjust
# the axis width, the legend will be 'off the page'.
L =a.legend(matdata4[s]['depth'], bbox_to_anchor=(1.05, 1), loc=2)
L.set_title('Depth [m]')
f.subplots_adjust(right=0.8) # adjust axis width
ax[-1].xaxis_date()
for a in ax[:-1]:
a.set_xticks(ax[-1].get_xticks())
a.set_xticklabels([])