Reading CTD data with PySeabird

Author: Guilherme Castelão

pySeabird is a package to parse/load CTD data files. It should be an easy task but the problem is that the format have been changing along the time. Work with multiple ships/cruises data requires first to understand each file, to normalize it into a common format for only than start your analysis. That can still be done with few general regular expression rules, but I would rather use strict rules. If I'm loading hundreds or thousands of profiles, I want to be sure that no mistake passed by. I rather ignore a file in doubt and warn it, than belive that it was loaded right and be part of my analysis.

With that in mind, I wrote this package with the ability to load multiple rules, so new rules can be added without change the main engine.

For more information, check the documentatio


In [1]:
%matplotlib inline

from seabird.cnv import fCNV

Let's first download an example file with some CTD data


In [2]:
!wget https://raw.githubusercontent.com/castelao/seabird/master/sampledata/CTD/dPIRX003.cnv


--2018-11-12 14:00:41--  https://raw.githubusercontent.com/castelao/seabird/master/sampledata/CTD/dPIRX003.cnv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.136.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.136.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47291 (46K) [text/plain]
Saving to: ‘dPIRX003.cnv’

dPIRX003.cnv        100%[===================>]  46.18K  --.-KB/s    in 0.01s   

2018-11-12 14:00:41 (4.33 MB/s) - ‘dPIRX003.cnv’ saved [47291/47291]


In [3]:
profile = fCNV('dPIRX003.cnv')

The profile dPIRX003.cnv.OK was loaded with the default rule cnv.yaml

The header (metadata)

The header is loaded into the .attributes as a dictionary. Note that the date was already converted into a datetime object.

There is a new attribute, not found in the file, that is 'md5'. This is the MD5 Hash for the original file. This might be usefull to double check the inputs when reproducing some analysis.

Since it's a dictionary, to extract the geographical coordinates, for example:


In [4]:
print ("The profile coordinates is latitude: %.4f, and longitude: %.4f" % \
    (profile.attributes['LATITUDE'], profile.attributes['LONGITUDE']))


The profile coordinates is latitude: 12.6743, and longitude: -38.0018

Or for an overview of all the attributes and data:


In [5]:
print("Header: %s" % profile.attributes.keys())
print(profile.attributes)


Header: dict_keys(['sbe_model', 'seasave', 'instrument_type', 'nquan', 'nvalues', 'start_time', 'bad_flag', 'file_type', 'md5', 'datetime', 'LATITUDE', 'LONGITUDE', 'filename'])
{'sbe_model': '9', 'seasave': 'Win32 V 5.37d', 'instrument_type': 'CTD', 'nquan': '11', 'nvalues': '349', 'start_time': 'Apr 02 2008 18:52:30', 'bad_flag': '-9.990e-29', 'file_type': 'ascii', 'md5': '1ad70243bdea4bfd4c6f60ca7141bf2b', 'datetime': datetime.datetime(2008, 4, 2, 18, 52, 30), 'LATITUDE': 12.674333333333333, 'LONGITUDE': -38.00183333333333, 'filename': 'dPIRX003.cnv'}

The data

The object profile behaves like a dictionary with the data. So to check the available data one can just


In [6]:
print(profile.keys())


['timeS', 'PRES', 'TEMP', 'TEMP2', 'CNDC', 'CNDC2', 'potemperature', 'potemperature2', 'PSAL', 'PSAL2', 'flag']

Each data returns as a masked array, hence all values equal to profile.attributes['bad_flag'] will return as a masked value


In [7]:
profile['TEMP2'][:25]


Out[7]:
masked_array(data=[15.7969, 15.5144, 15.4179, 15.3232, 15.1983, 15.1154,
                   15.076, 15.0842, 15.0535, 15.0212, 14.9611, 14.8273,
                   14.7881, 14.7453, 14.7223, 14.7371, 14.7438, 14.7413,
                   14.7404, --, 14.4535, 14.3612, 14.2561, 14.2155,
                   14.2098],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False,  True, False, False, False, False,
                   False],
       fill_value=-9.99e-29)

As a regular masked array, let's check the mean and standard deviation between the two temperature sensors


In [8]:
print(profile['TEMP'].mean(), profile['TEMP'].std())
print(profile['TEMP2'].mean(), profile['TEMP2'].std())


11.580700613496932 1.5675053880983678
11.540421100917431 1.6110723038427497

In [9]:
from matplotlib import pyplot as plt

plt.plot(profile['TEMP'], profile['PRES'],'b')
plt.plot(profile['TEMP2'], profile['PRES'],'g')
plt.gca().invert_yaxis()
plt.xlabel('temperature')
plt.ylabel('pressure [dbar]')
plt.title(profile.attributes['filename'])


Out[9]:
Text(0.5, 1.0, 'dPIRX003.cnv')

We can also export the data into a pandas DataFrame for easier data manipulation later on:


In [10]:
df = profile.as_DataFrame()
df.head()


Out[10]:
timeS PRES TEMP TEMP2 CNDC CNDC2 potemperature potemperature2 PSAL PSAL2 flag LATITUDE LONGITUDE
0 246.446 162.0 15.8635 15.7969 4.468541 4.497739 15.8372 15.7713 35.7561 36.0788 0.0 12.674333 -38.001833
1 249.417 163.0 15.4798 15.5144 4.459094 4.460689 15.4545 15.4889 36.0197 36.0024 0.0 12.674333 -38.001833
2 250.462 164.0 15.3983 15.4179 4.444983 4.448821 15.3729 15.3924 35.9660 35.9829 0.0 12.674333 -38.001833
3 251.428 165.0 15.2930 15.3232 4.432046 4.436992 15.2675 15.2977 35.9446 35.9618 0.0 12.674333 -38.001833
4 252.285 166.0 15.1721 15.1983 4.417620 4.422013 15.1466 15.1728 35.9237 35.9397 0.0 12.674333 -38.001833