Reading netCDF data

Interactively exploring a netCDF File

Let's explore a netCDF file from the Atlantic Real-Time Ocean Forecast System

first, import netcdf4-python


In [ ]:
import netCDF4

Create a netCDF4.Dataset object

  • f is a Dataset object, representing an open netCDF file.
  • printing the object gives you summary information, similar to ncdump -h.

In [ ]:
f = netCDF4.Dataset('data/rtofs_glo_3dz_f006_6hrly_reg3.nc')
print f

Access a netCDF variable

  • variable objects stored by name in variables dict.
  • print the variable yields summary info (including all the attributes).
  • no actual data read yet (just have a reference to the variable object with metadata).

In [ ]:
print f.variables.keys() # get all variable names
temp = f.variables['temperature']  # temperature variable
print temp

List the Dimensions

  • All variables in a netCDF file have an associated shape, specified by a list of dimensions.
  • Let's list all the dimensions in this netCDF file.
  • Note that the MT dimension is special (unlimited), which means it can be appended to.

In [ ]:
for d in f.dimensions.items():
    print d

Each variable has a dimensions and a shape attribute.


In [ ]:
temp.dimensions

In [ ]:
temp.shape

Each dimension typically has a variable associated with it (called a coordinate variable).

  • Coordinate variables are 1D variables that have the same name as dimensions.
  • Coordinate variables and auxiliary coordinate variables (named by the coordinates attribute) locate values in time and space.

In [ ]:
mt = f.variables['MT']
depth = f.variables['Depth']
x,y = f.variables['X'], f.variables['Y']
print mt 
print x

Accessing data from a netCDF variable object

  • netCDF variables objects behave much like numpy arrays.
  • slicing a netCDF variable object returns a numpy array with the data.
  • Boolean array and integer sequence indexing behaves differently for netCDF variables than for numpy arrays. Only 1-d boolean arrays and integer sequences are allowed, and these indices work independently along each dimension (similar to the way vector subscripts work in fortran).

In [ ]:
time = mt[:]  # Reads the netCDF variable MT, array of one element
print time

In [ ]:
dpth = depth[:] # examine depth array
print dpth

In [ ]:
xx,yy = x[:],y[:]
print 'shape of temp variable:',temp.shape
tempslice = temp[0, dpth > 400, xx > xx.max()/2, yy < yy.max()/2]
print 'shape of temp slice:',tempslice.shape

What is the sea surface temperature and salinity at 50N, 140W?

Finding the latitude and longitude indices of 50N, 140W

  • The X and Y dimensions don't look like longitudes and latitudes
  • Use the auxilary coordinate variables named in the coordinates variable attribute, Latitude and Longitude

In [ ]:
lat, lon = f.variables['Latitude'], f.variables['Longitude']
print lat

Aha! So we need to find array indices iy and ix such that Latitude[iy, ix] is close to 50.0 and Longitude[iy, ix] is close to -140.0 ...


In [ ]:
latvals = lat[:]; lonvals = lon[:] # extract lat/lon values (in degrees) to numpy arrays
import numpy as np
# a function to find the index of the point closest (in squared distance) to give lat/lon value.
def getclosest_ij(lats,lons,latpt,lonpt):
    dist_sq = (lats-latpt)**2 + (lons-lonpt)**2  # find squared distance of every point on grid
    minindex_flattened = dist_sq.argmin() # 1D index of minimum dist_sq element
    # Get 2D index for latvals and lonvals arrays from 1D index
    return np.unravel_index(minindex_flattened, latvals.shape)
iy_min, ix_min = getclosest_ij(latvals, lonvals, 50., -140)

Now we have all the information we need to find our answer.

|----------+--------|
| Variable |  Index |
|----------+--------|
| MT       |      0 |
| Depth    |      0 |
| Y        | iy_min |
| X        | ix_min |
|----------+--------|

What is the sea surface temperature and salinity at the specified point?


In [ ]:
sal = f.variables['salinity']
# Read values out of the netCDF file for temperature and salinity
print temp[0,0,iy_min,ix_min], temp.units
print sal[0,0,iy_min,ix_min], sal.units

Remote data access via openDAP

  • Remote data can be accessed seamlessly with the netcdf4-python API
  • Access happens via the DAP protocol and DAP servers, such as TDS.
  • many formats supported, like GRIB, are supported "under the hood".

The following example showcases some nice netCDF features:

  1. We are seamlessly accessing remote data, from a TDS server.
  2. We are seamlessly accessing GRIB2 data, as if it were netCDF data.
  3. We are generating metadata on-the-fly.
  4. We are using pyUDL to get URL from TDS data catalog.

In [ ]:
# import pyUDL
from pyudl.tds import get_latest_dods_url
# specify URL for TDS data catalog
gfs_catalog_url = "http://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/Global_0p5deg/catalog.xml" 
# get latest forecast time available from catalog
latest = get_latest_dods_url(gfs_catalog_url)
print latest
# open the remote dataset
gfs = netCDF4.Dataset(latest)

In [ ]:
# Look at metadata for a specific variable
# gfs.variables.keys() will show all available variables.
sfctmp = gfs.variables['Temperature_surface']
print sfctmp
for dname in sfctmp.dimensions: # print dimensions associated with this variable
    print gfs.variables[dname]

Missing values

  • when data == var.missing_value somewhere, a masked array is returned.
  • illustrate with soil moisture data (only defined over land)
  • white areas on plot are masked values over water.

In [ ]:
soilmvar = gfs.variables['Volumetric_Soil_Moisture_Content_depth_below_surface_layer']
soilm = soilmvar[0,0,::-1,:] # flip the data in latitude so North Hemisphere is up on the plot
print soilm.shape, type(soilm), soilmvar.missing_value
import matplotlib.pyplot as plt
%matplotlib inline
cs = plt.contourf(soilm)

Packed integer data

There is a similar feature for variables with scale_factor and add_offset attributes.

  • short integer data will automatically be returned as float data, with the scale and offset applied.

Dealing with dates and times

  • time variables usually measure relative to a fixed date using a certain calendar, with units specified like hours since YY:MM:DD hh-mm-ss.
  • num2date and date2num convenience functions provided to convert between these numeric time coordinates and handy python datetime instances.
  • date2index finds the time index corresponding to a datetime instance.

In [ ]:
from netCDF4 import num2date, date2num, date2index
timedim = sfctmp.dimensions[0]
print timedim
times = gfs.variables[timedim]
print times.units, times[:]

In [ ]:
dates = num2date(times[:], times.units)
print [date.strftime('%Y%m%d%H') for date in dates]

Get index associated with a specified date, extract forecast data for that date.


In [ ]:
from datetime import datetime, timedelta
date = datetime.now() + timedelta(days=3)
print date
ntime = date2index(date,times,select='nearest')
print ntime, dates[ntime]

Get temp forecast for Boulder (near 40N, -105W)

  • use function getcloses_ij we created before...

In [ ]:
lats, lons = gfs.variables['lat'][:], gfs.variables['lon'][:]
# lats, lons are 1-d. Make them 2-d using numpy.meshgrid.
lons, lats = np.meshgrid(lons,lats)
j, i = getclosest_ij(lats,lons,40,-105)
fcst_temp = sfctmp[ntime,j,i]
print 'Boulder forecast valid at %s UTC = %5.1f %s' % (dates[ntime],fcst_temp,sfctmp.units)

Exercise

Get probability of precip (3 hour accumulation > 0.25mm) at Boulder for tomorrow from SREF.

(catalog URL = 'http://thredds.ucar.edu/thredds/catalog/grib/NCEP/SREF/CONUS_40km/ensprod/catalog.html')

Simple multi-file aggregation

What if you have a bunch of netcdf files, each with data for a different year, and you want to access all the data as if it were in one file?


In [ ]:
!ls -l data/prmsl*nc

MFDataset uses file globbing to patch together all the files into one big Dataset. You can also pass it a list of specific files.

Limitations:

  • It can only aggregate the data along the leftmost dimension of each variable.
  • only works with NETCDF3, or NETCDF4_CLASSIC formatted files.
  • kind of slow.

In [ ]:
mf = netCDF4.MFDataset('data/prmsl*nc')
times = mf.variables['time']
dates = num2date(times[:],times.units)
print 'starting date = ',dates[0]
print 'ending date = ',dates[-1]
prmsl = mf.variables['prmsl']
print times.shape, prmsl.dimensions, prmsl.shape

Closing your netCDF file

It's good to close netCDF files, but not actually necessary when Dataset is open for read access only.


In [ ]:
f.close()
gfs.close()

That's it!

Now you're ready to start exploring your data interactively.

To be continued with Writing netCDF data ....