This page is necessary because a snow data product I was trying to use is not georeferencing in standard off the shelf tools:
[fedora@lugosi ~]$ gdalinfo NETCDF:nhtsw100e2_20030107_20030113.nc:merged_snow_cover_extent
Warning 1: dimension #2 (cols) is not a Longitude/X dimension.
Warning 1: dimension #1 (rows) is not a Latitude/Y dimension.
Driver: netCDF/Network Common Data Format
Files: nhtsw100e2_20030107_20030113.nc
Size is 180, 180
Coordinate System is `'
Note that the coordinate system is empty and there are a couple of warnings. Turns out, the same thing fixes both problems. Long story short, the file is missing coordinate variables for the rows and cols dimensions. Now it seems obvious that this is what the warning messages were trying to tell me, but at the outset I was too dense to pick up on it.
To experience the entire thread of discovery, and/or learn how to mess around with netCDF using Python, read on. If you just want to see how to fix it, skip ahead to the section "So why isn't it working?"
This section starts with the basics. Open the file, sniff around to see what's in there, and print out some key bits of information. The very first thing to do is to tell Python what libraries we're going to need.
In [4]:
import mpl_toolkits.basemap.pyproj as pyproj
#import pyproj
import netCDF4 as nc
import os
import numpy as np
from mpl_toolkits.basemap import Basemap
Before opening this notebook, I grabbed an example data file from the National Snow and Ice Data Center. (ftp://sidads.colorado.edu/pub/DATASETS/nsidc0531_MEASURES_nhsnow_weekly100/2003/nhtsw100e2_20030107_20030113.nc) This file is supposed to be in what's known as EASE Grid 2.0 format. Unfortunately, no known GIS tools (commercial or open source) read the file. Our objective here is to come up with a method of converting the file to something that the tools can read.
So the very first step is to change to the directory containing the file using os.chdir. It works much as you should expect. Give it the full path to where you want to be.
In [5]:
os.chdir("/home/vagrant/ipynb/")
Now it's time to open the file. When we imported the NetCDF library above, we told python to associate it with the abbreviation "nc". To open the file, we need to use a function called "Dataset" in the netcdf library. All we need to do is tell it what filename we're interested in. This looks like the following:
In [36]:
d = nc.Dataset('nhtsw100e2_20030107_20030113.nc')
Now we have a variable "d" that we can use to access the contents of the netCDF file. The NetCDF library is pretty smart. It doesn't try to load the whole file into memory at this point, as many NetCDF files can be pretty large. You can use this "d" variable to look at the descriptive metadata, get a sense for the size of the grids, or eventually, ask for the actual data values. For the moment, let's just do the blindingly simple thing and "print" the variable to the terminal.
In [38]:
#print d
print d.variables['rows'][::]
This simple command gives us a lot of information. This particular file has a lot of metadata, which is all printed out. Some files will have more and some will have less. For the moment, just notice that the "Conventions" attribute specifies "CF-1.6". That means that the rest of the attributes should be interpreted using the CF v1.6 specification found on http://cfconventions.org/. The real meat and potatoes of a NetCDF file are the "dimensions" and "variables". All useful NetCDF files will have dimensions and variables.
Dimensions have a name and a size. We can see from the above that this file has three dimensions (time, rows, cols). The time dimension has a size of 1, while rows and cols both have a size of 180. Dimensions in and of themselves are not very interesting. They exist mainly to specify the shape of gridded data stored in Variables.
Variables have a name, a data type, and zero or more dimensions. If a variable has zero dimensions (like "coord_system", above), it is a scalar and can only store one value. Otherwise, the dimensions specify the shape of the grid. The time variable has one dimension, also named time. Latitude and longitude have two dimensions (rows and cols), and everything else except "coord_system" has three dimensions (time, rows, cols).
Starting simple, let's look at the "time" variable.
In [8]:
print d.variables['time']
Printing an individual variable gives quite a bit more information about the variable than just printing out an overview of the file. This information tends to be directly related to the variable itself. We can see the variable is a 32 bit integer. Since it has a dimension of time, and the time dimension has a size of 1, we expect that the current_shape would be a single dimension of one value. The rest of the attributes are human friendly: you as a human can pretty much figure it out just by looking.
Lets check if the contents of the time variable meet our expectations. From the filename, this file contains data for the week of Jan 7-13, 2003. Without being too careful about leap years and such, let's check to see if the time variable contains a number that works out to be about the number of days that would put us in Jan 2003 if we were to start counting on Oct 3, 1966.
In [ ]:
print d.variables['time'][0]
Here for the first time, we've asked for actual data from the file. To get data, you index the variable of interest. In the above, we asked for the first value in the time array (index 0).
Now we just have to see if 13251 days puts us in 2003.
In [ ]:
full_years = 365.25 * (2002-1966)
oct_to_jan = np.sum([(31-3),30,31]) # rest of oct, nov, dec
print full_years, full_years + oct_to_jan
print d.variables['time'][0] - (full_years + oct_to_jan)
Our faith in the ability of computers to count is affirmed. All hail our cyber overlords. The day specified is Jan 13, 2003 plus or minus 1 day for alignment of leap years with our time window.
The attributes "units", "standard_name", and "axis" are part of the CF 1.6 convention that the file respects. And since this variable is in the file adhering to CF 1.6, the variable adheres to it too. These attributes have special meanings, and you have to read the convention specification to find out what that meaning is.
One more term to remember is "coordinate variable". A one dimensional variable that has the same name as it's dimension is a coordinate variable. Hence, the time variable is the only coordinate variable in this file. Coordinate variables are how you translate from unitless grid indices to real values along the axis. In the above, index 0 for the time dimension stands for Jan 13, 2003. This relationship holds for any variable in this file which uses the time dimension.
Sometimes, the grid cells are not evenly spaced. The latitude and longitude variables for this EASE Grid 2.0 are examples of this. The data are spaced at even 100km intervals in the geospatial projection, which does not translate to nicely gridded lat/lon data. Auxiliary coordinate variables are 2D variables which can be used to lookup the latitude and longitude of any row/col pair.
In [ ]:
print d.variables['latitude']
This variable stores the latitude of each grid cell. The fact that "units" is "degrees_north" and/or the value "latitude" for the standard_name attribute may be used to determine that this is an auxiliary coordinate variable, and it contains the latitude values. The name of the variable is not special. It could be called 'i_do_not_contain_latitudes', but if one or more of the special attributes contained the magic latitude-indicating values, it would still be interpreted as latitudes.
Let's see what's inside.
In [ ]:
print d.variables['latitude'][0,0]
Because this is a 2D variable, we had to provide a 2D index. The -- means that the latitude value at index [0,0] is nodata. The EASE Grid is a polar projection which looks more or less like a circle. The corners don't have data, so we really should have expected this. While we're here, let's load the whole latitude array into memory because it's small.
In [9]:
lats = d.variables['latitude'][:]
print lats.shape
This gives us a new variable called "lats". It has a copy of the values that were in the file. All 180x180 of them. Changing the values in the "lats" variable does not change the values in the file itself. We used a technique called "slicing" (e.g., a colon instead of an index) to load in the values we wanted. Normally, you say [min_index:max_index] if you want to specify a subset. The default value of min_index is the first array value, and the default value of max_index is the last array value. In this case, we wanted everything, so we left out both the min and the max, allowing both to take on their default values.
There's actually a little bit more than this going on since we're talking about a 2D array, but that is out of scope for this notebook. Slicing is a powerful python technique which you need to spend some brain cells on. Go forth and learn.
In this case, the python variable "lats" is what's known as a masked array. The netcdf library automatically masked out the nodata values for us, which is why we just saw a '--' above when we printed out the latitude value of the corner. Masked arrays are a combination of a boolean (True/False) mask, and the values. If the mask flag for a cell is true, the cell is considered to contain "nodata". So let's revisit that masked out corner.
In [10]:
print "The mask is: ", lats.mask[0,0]
print "The data value is: ", lats.data[0,0]
If you scroll back up, the fill value in the netCDF file was -999. So you can see that python is making things a little nicer for you in that you don't have to always check to see if the grid cell contains the fill value/nodata value.
Let's set the corner value to something and then mask it back out.
In [11]:
lats[0,0] = 42.
print lats[0,0]
print "The mask is: ", lats.mask[0,0]
print "The data value is: ", lats.data[0,0]
In [12]:
# ... and set it back
lats[0,0] = np.ma.masked
print lats[0,0]
print "The mask is: ", lats.mask[0,0]
print "The data value is: ", lats.data[0,0]
Whoops! It masked the value out for us all right, but it didn't change the value back to -999. Keep this in mind. Depending on your situation, it might be important. You can explicitly set the value by using the "data" attribute.
In [13]:
lats.data[0,0] = -999.0
print lats[0,0]
print "The mask is: ", lats.mask[0,0]
print "The data value is: ", lats.data[0,0]
So far we have explored a whole lot of nothing useful. We know where each grid cell is, but (in this case) not whether they are covered in snow. We want to know whether they are covered in snow, so we use the non-coordinate-variable, non-auxiliary-coordinate-variable called "merged_snow_cover_extent". We will create a python variable as a shorthand to the data in the file, but we're not going to give a slicing expression. Without the slicing expression, python won't copy the data into memory from the file.
In [14]:
snow = d.variables['merged_snow_cover_extent']
print snow
See how the above behaved as if we typed "print d.variables['merged_snow_cover_extent']"? Now we can be lazy and just type "print snow" instead. You can still specify an index and get the data, but this time you are not working with a copy that resides in memory. Remember we did not explicitly ask python to copy all the data out of the file into a variable. Ergo, if you try and assign different values to cells in the "snow" variable, it will try to change the values in the file. In this case, we opened the file read-only (which is the default), so changing the value would fail.
In [15]:
print snow[0,0,0]
??! A 3D variable? Tricky. You can handle this by now, and you should be able to tell what order the coordinates go in by looking above.
So what happens if we try to change the values in the snow variable?
In [ ]:
#snow[0,0,0] = 42
This translates roughly as "Thou shalt not try to write to a file you opened as read only." Fear not the crappy error message. Just understand that when we created the "snow" variable in python, we essentially told python to treat the snow variable as a handle to the data on disk. Trying to set values in the snow variable attempts to write the new values to the file...which it can't do if the file has been opened for reading only.
Now, one final thing: the coordinates attribute lists the (auxiliary) coordinate variables which contain the location information in each cell. The CF 1.6 standard requires that the lat/lon of each cell be provided if the lat/lons are not evenly spaced out.
If you look above where we printed out the snow variable, you should see a "grid_mapping" attribute. This is another one of those attributes specified by CF-1.6. It contains the name of a dummy variable which describes the projection in which the data are presented. It's a dummy variable because it contains no data values, it serves only as a container for a bunch of descriptive attributes....all of which are described by CF-1.6. The snow variable declares that its projection information is contained in the "coord_system" dummy variable, so let's take a look.
In [16]:
coords = d.variables['coord_system']
print coords
The only parameter that a variable pointed to by a "grid_mapping" attribute must have is "grid_mapping_name". These names select the projection and are defined in http://cfconventions.org/Data/cf-conventions/cf-conventions-1.6/build/cf-conventions.html#appendix-grid-mappings. The other attributes define the parameters of the projection, which of course may vary from projection to projection. The parameters of the sphereoid may appear regardless of the projection selected.
This information gives enough information to calculate projected coordinates given lat/lon, or vice-versa. You will want to use pre-written library code for that.
Here is what we've discovered so far:
The big thing that is missing is a simple translation between rows/cols and locations in the projected coordinates. Presumably, once we provide this, other programs will start to grok the NetCDF file, or we can export to a different format like geotiff. Specifically, the most likely reason that the projection is not being read is that the file technically violates the CF 1.6 specification. When the projection is Lambert Azimuthal Equal Area, the spec says "The x (abscissa) and y (ordinate) rectangular coordinates are identified by the standard_name attribute values projection_x_coordinate and projection_y_coordinate respectively."
So, software designed to the specification is expecting coordinate variables for rows and cols, and will be looking for these variables to be tagged as "projection_x_coordinate" and "projection_y_coordinate". As noted above, these variables are entirely missing. It is possible that if they are provided, the software may start working.
We are going to assume that the relationship between the grid cell indices and the projected coordinates is linear, that the axes are orthogonal, and the (i,j) grid cell axes are not rotated with respect to the axes of the projected coordinate system. We will only generalize to this more complex scenario if our assumptions turn out to be false.
We need to find two points which have lat/lon data, these two points should be as far apart as possible, and the points should not be in the same column or row.
In [17]:
mysnow = d.variables['merged_snow_cover_extent'][:]
In [18]:
mysnow[0,0,79:101]
Out[18]:
In [19]:
unmasked_ij = np.where(np.logical_not(lats.mask))
indices = zip(*unmasked_ij)
first = indices[0]
last = indices[-1]
print "First: ", first
print "Last: ", last
In [20]:
lons = d.variables['longitude']
In [21]:
print "First: ", lats[first], lons[first]
print "Last: ", lats[last], lons[last]
If we were trying to get a simple linear fit from rows/cols to lat/lon, this would be a problem. However, we know this is not possible so let's take these lat/lons and calculate the projected coordinates. We will be using Proj.4. (http://trac.osgeo.org/proj/) The first step is to get the parameter values for the projection from the coord_system variable.
We have not yet done this. We've printed the values and looked at them, but we are now moving into the realm of having the computer fetch specific values and use them automatically. Getting the value of an attribute in the netCDF file is accomplished exactly as if it were an attribute of a python variable:
In [22]:
#remember that coords is the python variable referring to the "coord_system" netcdf variable.
lat_0 = coords.latitude_of_projection_origin
lon_0 = coords.longitude_of_projection_origin
false_e = coords.false_easting
false_n = coords.false_northing
s_major = coords.semimajor_axis
s_minor = coords.semiminor_axis
Now we make a projection object. It's just a handy way to carry around a description of which projection equations to use, and what the parameters are.
In [23]:
the_proj = pyproj.Proj(proj="laea", lat_0=lat_0, lon_0=lon_0, x_0=false_e, y_0=false_n, a=s_major, b=s_minor)
geo = pyproj.Proj(proj="latlong", a=s_major, b=s_minor)
Now we can compute the projected coordinates for the lat/lon coordinates of the two points.
In [24]:
x1,y1 = pyproj.transform(geo,the_proj, lons[first], lats[first])
x2,y2 = pyproj.transform(geo,the_proj, lons[last], lats[last])
In [25]:
print x1,y1
print x2,y2
So, it's now a matter of making a linear fit for x and a linear fit for y.
In [26]:
# it's (rows,cols), so index[1]==column
m_x = (x2-x1)/(last[1]-first[1])
b_x = x1 - (m_x*first[1])
print x2, last[1]*m_x+b_x
Woo hoo the fit for x seems to work. Now on to y.
In [27]:
# it's (rows,cols) so index[0]==rows
m_y = (y2-y1)/(last[0]-first[0])
b_y = y1 -(m_y*first[0])
print y2, last[0]*m_y+b_y
Just to doublecheck, we know that the data set is on a 100km grid, so m_x and m_y should both be 100000.
In [28]:
print m_x, m_y
Close enough for me.
Python, specifically the "numpy" library, makes it easy to write equations which manipulate entire arrays of data. Usually, the equation is applied to each element of the input array and produces a corresponding output array. Numpy is well worth learning.
In the following, the function "np.arange" generates an array containing the numbers 0, 1, 2, ... {n-1}, where n is either the number of rows or cols. This entire array is transformed with the linear fit we produced above, yielding the projected x/y coordinates for each row and column.
In [29]:
cv_rows = m_x * np.arange(len(d.dimensions['rows'])) + b_x
cv_cols = m_y * np.arange(len(d.dimensions['cols'])) + b_y
Now we need to store these new coordinate variables in the file and add the expected attributes. To do so, we need to close the netcdf file and open it again for writing. Make sure we open for read/write! Opening just for write will try to delete the existing file so you have a nice clean slate.
In [30]:
print cv_rows
print cv_cols
In [31]:
# first, close the current file
d.close()
# now open again for writing
d = nc.Dataset('nhtsw100e2_20030107_20030113.nc', 'r+')
Creating variables in netcdf is pretty simple. You just need to specify the variable name, the data type, and a list of dimensions as strings. Coordinate variables are super simple because the name of the variable is the same as the name of the only dimension they have.
In [32]:
coords = d.variables['coord_system']
print coords
In [ ]:
coords.test = "test"
In [3]:
d.close()
In [33]:
# make the rows coordinate variable
# note the variable name is the same as the dimension name
rows_var = d.createVariable('rows',np.float32, ('rows',))
# give it the expected attributes
rows_var.standard_name = 'projection_y_coordinate'
rows_var.axis = 'Y'
rows_var.units = 'meters'
# write the values to the variable
rows_var[:] = cv_rows
In [34]:
# make the cols coordinate variable
# note the variable name is the same as the dimension name
cols_var = d.createVariable('cols',np.float32, ('cols',))
# give it the expected attributes
cols_var.standard_name = 'projection_x_coordinate'
cols_var.axis = 'X'
cols_var.units = 'meters'
# write the values to the variable
cols_var[:] = cv_cols
In [35]:
#close the file to make sure everything is written
d.close()
Poof. GDAL works now:
[fedora@lugosi ~]$ gdalinfo NETCDF:nhtsw100e2_20030107_20030113.nc:merged_snow_cover_extent
Driver: netCDF/Network Common Data Format
Files: nhtsw100e2_20030107_20030113.nc
Size is 180, 180
Coordinate System is:
PROJCS["LAEA (WGS84) ",
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,
AUTHORITY["EPSG","7030"]],
TOWGS84[0,0,0,0,0,0,0],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0,
AUTHORITY["EPSG","8901"]],
UNIT["degree",0.0174532925199433,
AUTHORITY["EPSG","9108"]],
AUTHORITY["EPSG","4326"]],
PROJECTION["Lambert_Azimuthal_Equal_Area"],
PARAMETER["latitude_of_center",90],
PARAMETER["longitude_of_center",0],
PARAMETER["false_easting",0],
PARAMETER["false_northing",0]]
Because gdal is now detecting the spatial reference system, one can use it to convert the variables to geotiff. But we are not quite done. Remember the slope of the line fit between "cols" and the projected y coordinate? It was negative, right?
In [ ]:
print m_y
Yup. Negative. That means row zero translates to the biggest projected Y value and row 179 translates to the smallest projected Y value. So row 0 starts at the top and works its way down. The netcdf driver for gdal tries to detect whether the rows are top-down or bottom up, but sometimes it doesn't. Our data set is one of the lucky winners.
Configuration options from http://www.gdal.org/frmt_netcdf.html:
GDAL_NETCDF_BOTTOMUP=[YES/NO] : Set the y-axis order for import, overriding the order detected by the driver. This option is usually not needed unless a specific dataset is causing problems (which should be reported in GDAL trac).
In this case, gdal 1.11.1 autodetects a bottom-up dataset when it should be top down, so we must override. The working conversion command for gdal_translate is as follows:
gdal_translate -of GTiff -b 1 --config GDAL_NETCDF_BOTTOMUP NO \
NETCDF:nhtsw100e2_20030107_20030113:merged_snow_cover_extent snowcover.tif
Here we view the data in the file. First, we don't do anything special, just dump it to the screen exactly as it exists in the file.
In [ ]:
d = nc.Dataset('nhtsw100e2_20030107_20030113.nc')
snow = d.variables['merged_snow_cover_extent'][:]
lats = d.variables['latitude'][:]
lons = d.variables['longitude'][:]
imshow(snow[0,...])
In [ ]:
print "hello"
That's great, but we don't really have any indication that the georeferencing worked. The axes are strictly (i,j) without any physical meaning. In fact, we could have done this right off the bat, without fixing anything. What we really want to do is draw a quick map with crude coastlines and some lat/lon lines.
In [ ]:
def init_basemap() :
m = Basemap(width=100000*180, height=100000*180,
#llcrnrx=cv_cols[0], llcrnry=cv_rows[-1], urcrnrx=cv_cols[-1], urcrnry=cv_rows[0],
resolution='l', projection='laea',
lon_0=lon_0, lat_0=lat_0)
m.drawcoastlines()
m.drawcountries()
m.drawmeridians(np.arange(-180.,180.,20.),labels=[False,False,False,True])
m.drawparallels(np.arange(10.,80.,20.), labels=[True,False,False,False])
return m
The basic strategy is to use the map template above by calling "init_basemap", then draw the data on it. The data are drawn using "filled contours" between the data points. This may or may not be what we want.
In [ ]:
figsize(12,10)
m = init_basemap()
c = m.contourf(lons[:], lats[:],
snow[0,...], 20, latlon=True)
cb = m.colorbar(c)
title("Snow cover, 13 Jan 2003")
As shown, the data values line up with the coastlines, so it appears that the georeferencing worked just fine.