syncID: 3857005e98a544a88a5e58625e32b514 title: "Introduction to working with NEON eddy flux data" description: "Download and navigate NEON eddy flux data, including basic transformations and merges" dateCreated: 2019-07-09 authors: Claire K. Lunch contributors: estimatedTime: 1 hour packagesLibraries: rhdf5, neonUtilities, ggplot2 topics: HDF5, eddy-covariance, eddy-flux languagesTool: R dataProduct: DP4.00200.001 code1: /R/eddy-intro/eddy_intro.r tutorialSeries:

urlTitle: eddy-data-intro

This data tutorial provides an introduction to working with NEON eddy flux data, using the neonUtilities R package. If you are new to NEON data, we recommend starting with a more general tutorial, such as the neonUtilities tutorial or the Download and Explore tutorial. Some of the functions and techniques described in those tutorials will be used here, as well as functions and data formats that are unique to the eddy flux system.

This tutorial assumes general familiarity with eddy flux data and associated concepts.

1. Setup

Start by installing and loading packages and setting options. To work with the NEON flux data, we need the rhdf5 package, which is hosted on Bioconductor, and requires a different installation process than CRAN packages:


In [ ]:
install.packages('BiocManager')
BiocManager::install('rhdf5')
install.packages('neonUtilities')

In [1]:
options(stringsAsFactors=F)

library(neonUtilities)

Use the zipsByProduct() function from the neonUtilities package to download flux data from two sites and two months. The transformations and functions below will work on any time range and site(s), but two sites and two months allows us to see all the available functionality while minimizing download size.

Inputs to the zipsByProduct() function:

  • dpID: DP4.00200.001, the bundled eddy covariance product
  • package: basic (the expanded package is not covered in this tutorial)
  • site: NIWO = Niwot Ridge and HARV = Harvard Forest
  • startdate: 2018-06 (both dates are inclusive)
  • enddate: 2018-07 (both dates are inclusive)
  • savepath: modify this to something logical on your machine
  • check.size: T if you want to see file size before downloading, otherwise F

The download may take a while, especially if you're on a slow network.


In [2]:
zipsByProduct(dpID="DP4.00200.001", package="basic", 
              site=c("NIWO", "HARV"), 
              startdate="2018-06", enddate="2018-07",
              savepath="/data", 
              check.size=F)


Downloading files totaling approximately 313.289221 MB
Downloading 4 files
  |======================================================================| 100%
4 files downloaded to /data/filesToStack00200

2. Data Levels

There are five levels of data contained in the eddy flux bundle. For full details, refer to the NEON algorithm document.

Briefly, the data levels are:

  • Level 0' (dp0p): Calibrated raw observations
  • Level 1 (dp01): Time-aggregated observations, e.g. 30-minute mean gas concentrations
  • Level 2 (dp02): Time-interpolated data, e.g. rate of change of a gas concentration
  • Level 3 (dp03): Spatially interpolated data, i.e. vertical profiles
  • Level 4 (dp04): Fluxes

The dp0p data are available in the expanded data package and are beyond the scope of this tutorial.

The dp02 and dp03 data are used in storage calculations, and the dp04 data include both the storage and turbulent components. Since many users will want to focus on the net flux data, we'll start there.

3. Extract Level 4 data (Fluxes!)

To extract the Level 4 data from the HDF5 files and merge them into a single table, we'll use the stackEddy() function from the neonUtilities package.

stackEddy() requires two inputs:

  • filepath: Path to a file or folder, which can be any one of:
    1. A zip file of eddy flux data downloaded from the NEON data portal
    2. A folder of eddy flux data downloaded by the zipsByProduct() function
    3. The folder of files resulting from unzipping either of 1 or 2
    4. A single HDF5 file of NEON eddy flux data
  • level: dp01-4

Input the filepath you downloaded to using zipsByProduct() earlier, including the filestoStack00200 folder created by the function, and dp04:


In [3]:
flux <- stackEddy(filepath="/data/filesToStack00200/",
                 level="dp04")


Extracting data
  |======================================================================| 100%
Stacking data tables by month
  |======================================================================| 100%
Joining data variables
  |======================================================================| 100%

We now have an object called flux. It's a named list containing four tables: one table for each site's data, and variables and objDesc tables.


In [4]:
names(flux)


  1. 'HARV'
  2. 'NIWO'
  3. 'variables'
  4. 'objDesc'

Let's look at the contents of one of the site data files:


In [5]:
head(flux$NIWO)


A data.frame: 6 × 26
timeBgntimeEnddata.fluxCo2.nsae.fluxdata.fluxCo2.stor.fluxdata.fluxCo2.turb.fluxdata.fluxH2o.nsae.fluxdata.fluxH2o.stor.fluxdata.fluxH2o.turb.fluxdata.fluxMome.turb.veloFricdata.fluxTemp.nsae.fluxdata.foot.stat.veloFricdata.foot.stat.distZaxsMeasDispdata.foot.stat.distZaxsRghdata.foot.stat.distZaxsAbldata.foot.stat.distXaxs90data.foot.stat.distXaxsMaxdata.foot.stat.distYaxs90qfqm.fluxCo2.stor.qfFinlqfqm.fluxH2o.stor.qfFinlqfqm.fluxTemp.stor.qfFinl
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int><int>
12018-06-01T00:00:00.000Z2018-06-01T00:29:59.000Z0.1111935-0.061911860.173105319.401824 3.251126516.1506970.19707045 4.17120060.28.340.032214791000333.60133.4425.02110
22018-06-01T00:30:00.000Z2018-06-01T00:59:59.000Z0.9328922 0.085341170.847551010.444936-1.176833311.6217700.19699723 -0.91636910.28.340.330070821000258.54108.4250.04110
32018-06-01T01:00:00.000Z2018-06-01T01:29:59.000Z0.4673682 0.021772160.4455960 5.140617-4.3112673 9.4518840.06518208 -2.98149570.28.340.128760681000308.58125.1058.38110
42018-06-01T01:30:00.000Z2018-06-01T01:59:59.000Z0.7263614 0.249443660.4769178 9.017467 0.1980776 8.8193890.12964000-13.35562220.28.340.834000001000208.50 83.4075.06110
52018-06-01T02:00:00.000Z2018-06-01T02:29:59.000Z0.4740572 0.225243630.2488136 3.180386 0.1316297 3.0487560.17460706 -5.34065030.28.340.834000001000208.50 83.4066.72110
62018-06-01T02:30:00.000Z2018-06-01T02:59:59.000Z0.8807022 0.070780070.8099221 4.398761-0.2989443 4.6977060.10477970 -7.27392060.28.340.834000001000208.50 83.4041.70110

The variables and objDesc tables can help you interpret the column headers in the data table. The objDesc table contains definitions for many of the terms used in the eddy flux data product, but it isn't complete. To get the terms of interest, we'll break up the column headers into individual terms and look for them in the objDesc table:


In [6]:
term <- unlist(strsplit(names(flux$NIWO), split=".", fixed=T))
flux$objDesc[which(flux$objDesc$Object %in% term),]


A data.frame: 6 × 2
ObjectDescription
<chr><chr>
138angZaxsErthWind direction
171data Represents data fields
343qfFinl The final quality flag indicating if the data are valid for the given aggregation period (1=fail, 0=pass)
420qfqm Quality flag and quality metrics, represents quality flags and quality metrics that accompany the provided data
604timeBgn The beginning time of the aggregation period
605timeEnd The end time of the aggregation period

For the terms that aren't captured here, fluxCo2, fluxH2o, and fluxTemp are self-explanatory. The flux components are

  • turb: Turbulent flux
  • stor: Storage
  • nsae: Net surface-atmosphere exchange

The variables table contains the units for each field:


In [7]:
flux$variables


A data.frame: 24 × 5
categorysystemvariablestatunits
<chr><chr><chr><chr><chr>
1datafluxCo2 nsae umolCo2 m-2 s-1
2datafluxCo2 stor umolCo2 m-2 s-1
3datafluxCo2 turb umolCo2 m-2 s-1
4datafluxH2o nsae W m-2
5datafluxH2o stor W m-2
6datafluxH2o turb W m-2
7datafluxMometurb m s-1
8datafluxTempnsae W m-2
9datafluxTempstor W m-2
10datafluxTempturb W m-2
11datafoot statangZaxsErth deg
12datafoot statdistReso m
13datafoot statveloYaxsHorSd m s-1
14datafoot statveloZaxsHorSd m s-1
15datafoot statveloFric m s-1
16datafoot statdistZaxsMeasDispm
17datafoot statdistZaxsRgh m
18datafoot statdistZaxsAbl m
19datafoot statdistXaxs90 m
20datafoot statdistXaxsMax m
21datafoot statdistYaxs90 m
22qfqmfluxCo2 stor NA
23qfqmfluxH2o stor NA
24qfqmfluxTempstor NA

Let's plot some data! First, we'll need to convert the time stamps to an R date-time format (right now they're just character fields).

Time stamps

NEON sensor data come with time stamps for both the start and end of the averaging period. Depending on the analysis you're doing, you may want to use one or the other; for general plotting, re-formatting, and transformations, I prefer to use the start time, because there are some small inconsistencies between data products in a few of the end time stamps.

Note that all NEON data use UTC time, noted as tz="GMT" in the code below. This is true across NEON's instrumented, observational, and airborne measurements. When working with NEON data, it's best to keep everything in UTC as much as possible, otherwise it's very easy to end up with data in mismatched times, which can cause insidious and hard-to-detect problems. Be sure to include the tz argument in all the lines of code below - if there is no time zone specified, R will default to the local time zone it detects on your operating system.


In [8]:
timeB <- as.POSIXct(flux$NIWO$timeBgn, 
                    format="%Y-%m-%dT%H:%M:%S", 
                    tz="GMT")
flux$NIWO <- cbind(timeB, flux$NIWO)

In [9]:
plot(flux$NIWO$data.fluxCo2.nsae.flux~timeB, 
     pch=".", xlab="Date", ylab="CO2 flux",
     xaxt="n")
axis.POSIXct(1, x=timeB, format="%Y-%m-%d")


Like a lot of flux data, these data have some stray spikes, but there is a clear diurnal pattern going into the growing season.

Let's trim down to just two days of data to see a few other details.


In [10]:
plot(flux$NIWO$data.fluxCo2.nsae.flux~timeB, 
     pch=20, xlab="Date", ylab="CO2 flux",
     xlim=c(as.POSIXct("2018-07-07", tz="GMT"),
            as.POSIXct("2018-07-09", tz="GMT")),
    ylim=c(-20,20), xaxt="n")
axis.POSIXct(1, x=timeB, format="%Y-%m-%d %H:%M:%S")


Note the timing of C uptake; the UTC time zone is clear here, where uptake occurs at times that appear to be during the night.

4. Merge flux data with other sensor data

Many of the data sets we would use to interpret and model flux data are measured as part of the NEON project, but are not present in the eddy flux data product bundle. In this section, we'll download PAR data and merge them with the flux data; the steps taken here can be applied to any of the NEON instrumented (IS) data products.

Download PAR data

To get NEON PAR data, use the loadByProduct() function from the neonUtilities package. loadByProduct() takes the same inputs as zipsByProduct(), but it loads the downloaded data directly into the current R environment.

Let's download PAR data matching the Niwot Ridge flux data. The inputs needed are:

  • dpID: DP1.00024.001
  • site: NIWO
  • startdate: 2018-06
  • enddate: 2018-07
  • package: basic
  • avg: 30

The new input here is avg=30, which downloads only the 30-minute data. Since the flux data are at a 30-minute resolution, we can save on download time by disregarding the 1-minute data files (which are of course 30 times larger). The avg input can be left off if you want to download all available averaging intervals.


In [11]:
pr <- loadByProduct("DP1.00024.001", site="NIWO", avg=30,
                    startdate="2018-06", enddate="2018-07",
                    package="basic", check.size=F)


Downloading files totaling approximately 1.281789 MB
Downloading 11 files
  |======================================================================| 100%

Stacking operation across a single core.
Stacking table PARPAR_30min
Merged the most recent publication of sensor position files for each site and saved to /stackedFiles
Copied the most recent publication of variable definition file to /stackedFiles
Finished: Stacked 1 data tables and 2 metadata tables!
Stacking took 0.2982938 secs

pr is another named list, and again, metadata and units can be found in the variables table. The PARPAR_30min table contains a verticalPosition field. This field indicates the position on the tower, with 10 being the first tower level, and 20, 30, etc going up the tower.

Join PAR to flux data

We'll connect PAR data from the tower top to the flux data.


In [12]:
pr.top <- pr$PARPAR_30min[which(pr$PARPAR_30min$verticalPosition==
                                max(pr$PARPAR_30min$verticalPosition)),]

loadByProduct() automatically converts time stamps when it reads the data, so here we just need to indicate which time field to use to merge the flux and PAR data.


In [13]:
timeB <- pr.top$startDateTime
pr.top <- cbind(timeB, pr.top)

And merge the two datasets:


In [14]:
fx.pr <- merge(pr.top, flux$NIWO, by="timeB")

In [15]:
plot(fx.pr$data.fluxCo2.nsae.flux~fx.pr$PARMean,
     pch=".", ylim=c(-20,20),
     xlab="PAR", ylab="CO2 flux")


If you're interested in data in the eddy covariance bundle besides the net flux data, the rest of this tutorial will guide you through how to get those data out of the bundle.

5. Vertical profile data (Level 3)

The Level 3 (dp03) data are the spatially interpolated profiles of the rates of change of CO2, H2O, and temperature. Extract the Level 3 data from the HDF5 file using stackEddy() with the same syntax as for the Level 4 data.


In [16]:
prof <- stackEddy(filepath="/data/filesToStack00200/",
                 level="dp03")


Extracting data
  |======================================================================| 100%
Stacking data tables by month
  |======================================================================| 100%
Joining data variables
  |======================================================================| 100%

In [17]:
head(prof$NIWO)


A data.frame: 6 × 506
timeBgntimeEnddata.co2Stor.rateRtioMoleDryCo2.0.1 mdata.co2Stor.rateRtioMoleDryCo2.0.2 mdata.co2Stor.rateRtioMoleDryCo2.0.3 mdata.co2Stor.rateRtioMoleDryCo2.0.4 mdata.co2Stor.rateRtioMoleDryCo2.0.5 mdata.co2Stor.rateRtioMoleDryCo2.0.6 mdata.co2Stor.rateRtioMoleDryCo2.0.7 mdata.co2Stor.rateRtioMoleDryCo2.0.8 mqfqm.tempStor.rateTemp.7.5 mqfqm.tempStor.rateTemp.7.6 mqfqm.tempStor.rateTemp.7.7 mqfqm.tempStor.rateTemp.7.8 mqfqm.tempStor.rateTemp.7.9 mqfqm.tempStor.rateTemp.8 mqfqm.tempStor.rateTemp.8.1 mqfqm.tempStor.rateTemp.8.2 mqfqm.tempStor.rateTemp.8.3 mqfqm.tempStor.rateTemp.8.4 m
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int><int><int><int><int><int><int><int><int>
12018-06-01T00:00:00.000Z2018-06-01T00:29:59.000Z-0.0002681938-0.0002681938-0.0002681938-0.0002681938-0.0002681938-0.0002681938-0.0002681938-0.00026819380000000000
22018-06-01T00:30:00.000Z2018-06-01T00:59:59.000Z 0.0004878799 0.0004878799 0.0004878799 0.0004878799 0.0004878799 0.0004673503 0.0004331343 0.00039891830000000000
32018-06-01T01:00:00.000Z2018-06-01T01:29:59.000Z 0.0005085725 0.0005085725 0.0005085725 0.0005085725 0.0005085725 0.0005025472 0.0004925052 0.00048246310000000000
42018-06-01T01:30:00.000Z2018-06-01T01:59:59.000Z 0.0013276966 0.0013276966 0.0013276966 0.0013276966 0.0013276966 0.0013735225 0.0014498989 0.00152627530000000000
52018-06-01T02:00:00.000Z2018-06-01T02:29:59.000Z 0.0007344040 0.0007344040 0.0007344040 0.0007344040 0.0007344040 0.0008510161 0.0010453695 0.00123972300000000000
62018-06-01T02:30:00.000Z2018-06-01T02:59:59.000Z-0.0009449785-0.0009449785-0.0009449785-0.0009449785-0.0009449785-0.0007653319-0.0004659209-0.00016650990000000000

6. Un-interpolated vertical profile data (Level 2)

The Level 2 data are interpolated in time but not in space. They contain the rates of change at the measurement heights.

Again, they can be extracted from the HDF5 files using stackEddy() with the same syntax:


In [18]:
prof.l2 <- stackEddy(filepath="/data/filesToStack00200/",
                 level="dp02")


Extracting data
  |======================================================================| 100%
Stacking data tables by month
  |======================================================================| 100%
Joining data variables
  |======================================================================| 100%

In [19]:
head(prof.l2$HARV)


A data.frame: 6 × 9
verticalPositiontimeBgntimeEnddata.co2Stor.rateRtioMoleDryCo2.meandata.h2oStor.rateRtioMoleDryH2o.meandata.tempStor.rateTemp.meanqfqm.co2Stor.rateRtioMoleDryCo2.qfFinlqfqm.h2oStor.rateRtioMoleDryH2o.qfFinlqfqm.tempStor.rateTemp.qfFinl
<chr><chr><chr><dbl><dbl><dbl><int><int><int>
10102018-06-01T00:00:00.000Z2018-06-01T00:29:59.000Z NaNNaN 2.583333e-05110
20102018-06-01T00:30:00.000Z2018-06-01T00:59:59.000Z 0.002194788NaN-2.008056e-04110
30102018-06-01T01:00:00.000Z2018-06-01T01:29:59.000Z-0.010752434NaN-1.901111e-04110
40102018-06-01T01:30:00.000Z2018-06-01T01:59:59.000Z 0.002556148NaN-7.419444e-05110
50102018-06-01T02:00:00.000Z2018-06-01T02:29:59.000Z-0.015977747NaN-1.537083e-04110
60102018-06-01T02:30:00.000Z2018-06-01T02:59:59.000Z-0.000537461NaN-1.874861e-04110

Note that here, as in the PAR data, there is a verticalPosition field. It has the same meaning as in the PAR data, indicating the tower level of the measurement.

7. Calibrated raw data (Level 1)

Level 1 (dp01) data are calibrated, and aggregated in time, but otherwise untransformed. Use Level 1 data for raw gas concentrations and atmospheric stable isotopes.

Using stackEddy() to extract Level 1 data requires additional inputs. The Level 1 files are too large to simply pull out all the variables by default, and they include mutiple averaging intervals, which can't be merged. So two additional inputs are needed:

  • avg: The averaging interval to extract
  • var: One or more variables to extract

What variables are available, at what averaging intervals? Another function in the neonUtilities package, getVarsEddy(), returns a list of HDF5 file contents. It requires only one input, a filepath to a single NEON HDF5 file:


In [20]:
vars <- getVarsEddy("/data/filesToStack00200/NEON.D01.HARV.DP4.00200.001.nsae.2018-07.basic.h5")
head(vars)


A data.frame: 6 × 12
sitelevelcategorysystemhorvertminameotypedclassdimoth
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
5HARVdp01dataamrs00006001mangNedXaxsH5I_DATASETCOMPOUND44640NA
6HARVdp01dataamrs00006001mangNedYaxsH5I_DATASETCOMPOUND44640NA
7HARVdp01dataamrs00006001mangNedZaxsH5I_DATASETCOMPOUND44640NA
9HARVdp01dataamrs00006030mangNedXaxsH5I_DATASETCOMPOUND1488 NA
10HARVdp01dataamrs00006030mangNedYaxsH5I_DATASETCOMPOUND1488 NA
11HARVdp01dataamrs00006030mangNedZaxsH5I_DATASETCOMPOUND1488 NA

Inputs to var can be any values from the name field in the table returned by getVarsEddy(). Let's take a look at CO2 and H2O, 13C in CO2 and 18O in H2O, at 30-minute aggregation. Let's look at Harvard Forest for these data, since deeper canopies generally have more interesting profiles:


In [21]:
iso <- stackEddy(filepath="/data/filesToStack00200/",
               level="dp01", var=c("rtioMoleDryCo2","rtioMoleDryH2o",
                                   "dlta13CCo2","dlta18OH2o"), avg=30)


Extracting data
  |======================================================================| 100%
Stacking data tables by month
  |======================================================================| 100%
Joining data variables
  |======================================================================| 100%

In [22]:
head(iso$HARV)


A data.frame: 6 × 84
verticalPositiontimeBgntimeEnddata.co2Stor.rtioMoleDryCo2.meandata.co2Stor.rtioMoleDryCo2.mindata.co2Stor.rtioMoleDryCo2.maxdata.co2Stor.rtioMoleDryCo2.varidata.co2Stor.rtioMoleDryCo2.numSampdata.co2Turb.rtioMoleDryCo2.meandata.co2Turb.rtioMoleDryCo2.minucrt.isoCo2.rtioMoleDryCo2.seucrt.isoCo2.rtioMoleDryH2o.meanucrt.isoCo2.rtioMoleDryH2o.variucrt.isoCo2.rtioMoleDryH2o.seucrt.isoH2o.dlta18OH2o.meanucrt.isoH2o.dlta18OH2o.variucrt.isoH2o.dlta18OH2o.seucrt.isoH2o.rtioMoleDryH2o.meanucrt.isoH2o.rtioMoleDryH2o.variucrt.isoH2o.rtioMoleDryH2o.se
<chr><chr><chr><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
10102018-06-01T00:00:00.000Z2018-06-01T00:29:59.000Z509.3375451.4786579.3518845.0795235NANA NA NaN NaN NA NaN NaN NA NaN NaN NA
20102018-06-01T00:30:00.000Z2018-06-01T00:59:59.000Z502.2736463.5470533.6622161.3652175NANA1.7649650.088484400.012264280.0143359930.025444540.0030174000.0081164130.069375140.0096402490.006855142
30102018-06-01T01:00:00.000Z2018-06-01T01:29:59.000Z521.6139442.8649563.0518547.9924235NANA NA NaN NaN NA NaN NaN NA NaN NaN NA
40102018-06-01T01:30:00.000Z2018-06-01T01:59:59.000Z469.6317432.6588508.7463396.8379175NANA1.1490780.089173880.015426790.0176836020.013735030.0027042200.0085827640.084894080.0085722880.005710986
50102018-06-01T02:00:00.000Z2018-06-01T02:29:59.000Z484.7725436.2842537.4641662.9449235NANA NA NaN NaN NA NaN NaN NA NaN NaN NA
60102018-06-01T02:30:00.000Z2018-06-01T02:59:59.000Z476.8554443.7055515.6598246.6969175NANA0.670111 NA NA0.0058904470.019321100.0020950660.0080491700.028138080.0025516720.002654748

Let's plot vertical profiles of CO2 and 13C in CO2 on a single day.

Here, for convenience, instead of converting the time stamps to a time format, it's easy to use the character format to extract the ones we want using grep(). And discard the verticalPosition values that are string values - those are the calibration gases.


In [23]:
iso.d <- iso$HARV[grep("2018-06-25", iso$HARV$timeBgn, fixed=T),]
iso.d <- iso.d[-which(is.na(as.numeric(iso.d$verticalPosition))),]


Warning message in which(is.na(as.numeric(iso.d$verticalPosition))):
“NAs introduced by coercion”

ggplot is well suited to these types of data, let's use it to plot the profiles.


In [24]:
library(ggplot2)

In [25]:
g <- ggplot(iso.d, aes(y=verticalPosition)) + 
  geom_path(aes(x=data.co2Stor.rtioMoleDryCo2.mean, 
                group=timeBgn, col=timeBgn)) + 
  theme(legend.position="none") + 
  xlab("CO2") + ylab("Tower level")
g


Warning message:
“Removed 2 rows containing missing values (geom_path).”

In [26]:
g <- ggplot(iso.d, aes(y=verticalPosition)) + 
  geom_path(aes(x=data.isoCo2.dlta13CCo2.mean, 
                group=timeBgn, col=timeBgn)) + 
  theme(legend.position="none") + 
  xlab("d13C") + ylab("Tower level")
g


Warning message:
“Removed 55 rows containing missing values (geom_path).”

The legends are omitted for space, see if you can work out the times of day the different colors represent.


In [ ]: