What is NetCDF?

NetCDF stands for Network Common Data Format and it is a "set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data". In short, NetCDF is a convenient way to store your data. The software is maintained by developers at Unidata, which is a subsidary of UCAR.

NetCDF has been around for quite some time now and has gained a lot of popularity within the geoscience community. It is a common, quickly becoming the standard, data storage format for oceanographic and atmospheric data. Almost all online databases, in the geosciences at least, provide their data as netCDF files. So, as a researcher in our field, learning how to work with NetCDF is almost necessity.

Using NetCDF in Python

There are currently several versions of NetCDF in use: netCDF classic, netCDF 64-bit and netCDF-4. netCDF classic is the original format and probably the most commonly used version, but it can only store up 2 Gigabits of data. netCDF 64-bit provides expanded data storage capabilities. netCDF-4 is the latest generation of NetCDF. It has the data storage capabilities netCDF 64-bit and borrows features from HDF5. You can learn more each version here.

The scipy package includes a module to read and create netCDF files. However, this module can only work with the netCDF classic data format. The netCDF4 module can handle all netCDF versions. For this reason, the rest of this notebook will discuss the netCDF4 module.

If you can get the netCDF4 module from the Python Package Index (PyPi) by typing pip install netCDF4 from a terminal shell. If you are using Python distribution such as Anaconda or Canopy, you may already have the module installed.

Creating a simple NetCDF dataset

To create a netCDF dataset, you use the Dataset method:


In [4]:
import netCDF4 as nc4

f = nc4.Dataset('sample.nc','w', format='NETCDF4') #'w' stands for write
tempgrp = f.createGroup('Temp_data')

The above line opens/creates a netCDF file called "sample.nc" in your current folder. f is a netCDF Dataset object that provides methods for storing variables and other information. Let's create some random data:


In [5]:
import numpy as np

lon = np.arange(45,101,2)
lat = np.arange(-30,25,2.5)
z = np.arange(0,200,10)
x = np.random.randint(10,25, size=(len(lon), len(lat), len(z)))
noise = np.random.rand(len(lon), len(lat), len(z))
temp_data = x+noise

print "shape of data: " 
print temp_data.shape


shape of data: 
(28, 22, 20)

So I just created some fake temperature data for some latitude, longitude and a number of depth levels. The shape of the data array is (28,22,20) representing (lon, lat, z). For concreteness, let's assume that this represents data for one day.

The next step is to specify the dimensions that describe my data:


In [6]:
#if you run this cell more than once it will throw an error
tempgrp.createDimension('lon', len(lon))
tempgrp.createDimension('lat', len(lat))
tempgrp.createDimension('z', len(z))
tempgrp.createDimension('time', None)


Out[6]:
<netCDF4.Dimension at 0x10c80e280>

Here I used the createDimension method. The first argument is a string specifying the name of the dimension; the second is an integer that specifies its length. In the last line, by using None, I have made time an unlimited dimension. This is in anticipation of receiving more data in the future. By making the time dimension unlimited, I can append data to that dimension indefinitely.

Next, I create netCDF variables using the dimensions I created:


In [7]:
longitude = tempgrp.createVariable('Longitude', 'f4', 'lon')
latitude = tempgrp.createVariable('Latitude', 'f4', 'lat')
levels = tempgrp.createVariable('Levels', 'i4', 'z')
temp = tempgrp.createVariable('Temperature', 'f4', ('time', 'lon', 'lat', 'z'), zlib=True)
time = tempgrp.createVariable('Time', 'i4', 'time')

Let's look at what we have done so far,


In [8]:
print(f)


<type 'netCDF4.Dataset'>
root group (NETCDF4 data model, file format UNDEFINED):
    dimensions(sizes): 
    variables(dimensions): 
    groups: Temp_data


In [9]:
print(f.groups['Temp_data']) #same as print(tempgrp)


<type 'netCDF4.Group'>
group /Temp_data:
    dimensions(sizes): lon(28), lat(22), z(20), time(0)
    variables(dimensions): float32 Longitude(lon), float32 Latitude(lat), int32 Levels(z), float32 Temperature(time,lon,lat,z), int32 Time(time)
    groups: 


In [10]:
tempgrp.variables.keys()


Out[10]:
['Longitude', 'Latitude', 'Levels', 'Temperature', 'Time']

In [11]:
tempgrp.dimensions.keys()


Out[11]:
['lon', 'lat', 'z', 'time']

In [12]:
longitude[:] = lon
latitude[:] = lat
levels[:] = z
temp[0,:,:,:] = temp_data

#get time in days since Jan 01,01
from datetime import datetime
today = datetime.today()
time_num = today.toordinal()
time[0] = time_num

The netCDF attributes should provide additional information about the dataset (i.e. metadata). You can add attributes to variables, groups and the dataset itself. Let's do that:


In [13]:
#Add global attributes
f.description = "Example dataset containing one group"
f.history = "Created " + today.strftime("%d/%m/%y")

#Add local attributes to variable instances
longitude.units = 'degrees east'
latitude.units = 'degrees north'
time.units = 'days since Jan 01, 0001'
temp.units = 'degrees K'
levels.units = 'meters'
temp.warning = 'This data is not real!'

In [14]:
f.close()

In [15]:
f = nc4.Dataset('sample.nc','r')
tempgrp = f.groups['Temp_data']

In [16]:
print "meta data for the dataset:"
print(f)
print "meta data for the Temp_data group:"
print(tempgrp)
print "meta data for Temperature variable:"
print(tempgrp.variables['Temperature'])


meta data for the dataset:
<type 'netCDF4.Dataset'>
root group (NETCDF4 data model, file format UNDEFINED):
    description: Example dataset containing one group
    history: Created 06/12/14
    dimensions(sizes): 
    variables(dimensions): 
    groups: Temp_data

meta data for the Temp_data group:
<type 'netCDF4.Group'>
group /Temp_data:
    dimensions(sizes): lon(28), lat(22), z(20), time(1)
    variables(dimensions): float32 Longitude(lon), float32 Latitude(lat), int32 Levels(z), float32 Temperature(time,lon,lat,z), int32 Time(time)
    groups: 

meta data for Temperature variable:
<type 'netCDF4.Variable'>
float32 Temperature(time, lon, lat, z)
    units: degrees K
    warning: This data is not real!
path = /Temp_data
unlimited dimensions: time
current shape = (1, 28, 22, 20)
filling on, default _FillValue of 9.96920996839e+36 used


In [22]:
tempgrp.variables.keys()


Out[22]:
[u'Longitude', u'Latitude', u'Levels', u'Temperature', u'Time']

In [24]:
time_vble = tempgrp.variables['Time']
print time_vble.ncattrs()
print time_vble.getncattr('units')


[u'units']
days since Jan 01, 0001

In [29]:
temp_vble = tempgrp.variables['Temperature']
for attr in temp_vble.ncattrs():
    print attr + ': ' + temp_vble.getncattr(attr)


units: degrees K
warning: This data is not real!

In [24]:
zlvls = tempgrp.variables['Levels'][:]
print zlvls


[  0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170
 180 190]

In [26]:
temp_0z = tempgrp.variables['Temperature'][0,:,:,0]

In [27]:
print "%.20f" %temp_0z[2,2]


10.75250816345214843750

In [ ]: