STREAM FARTS TOOLPACK

Motivation

Rivers, lakes and wetlands are generally known to be net sources of greenhouse gases to the atmosphere as they regularly emit CO2 and CH4, but much work remains quantifying spatial variability and sources, especially for large river systems highly impacted by human activities and infrastructure. The motivation behind this collaborative effort is to build an online, open-source tool for synthesizing remotely sensed watershed-related geospatial data with in situ field observations of water chemistry. Our ultimate goal is to automate data ingestion, fusion and exploration to help understand the patterns and controls driving changes in aquatic greenhouse gases to help inform regional and global carbon budgets.

Research Objectives/Science Driver

How does CO2 and CH4 vary spatially along large rivers?
How do carbon gas emission rates vary between two large rivers (the Amazon and the Mississippi)?
How can we visualize these data in a user-friendly interface accessible by researchers?
Can we use other variables to predict these trends in concentration?

Data

Will's dataset (the Amazon, currently geocoded csv)
Kuhn's USGS dataset (Mississippi river)
Third Party Data: Earth Engine imagery

Design & Engineering Tasks

Inputs Sensor data, shapefiles, earth engine datasets

Outputs Basic maps, scatterplots for relationships between variables, longitudinal spaghetti plots. All plots and maps are automated to consider user-defined options for variable choice and

Functions:
1) Automated sensor data ingestion and cleaning into the Earth Engine platform using python wrappers
2) Spatial joins of .csv data to other river geospatial datasets
3) Map Visualization of basic concentrations of methane and carbon dioxide
4) Trend Exploration
-scatterplots
-spaghetti plots
5) Predictive Machine Learning to relate other geospatial variables to our gas concentrations

Use Cases

High dimensional fine-scale temporal and spatial sensor networks are increasing being deployed for earth system monitoring. However, many research effforts face technical challenges when trying to access and manipulation these large data streams. Our project has three specific use cases:

-- Online, version controlled hosting of near-real time sensor data streams.
-- A simple exploratory data analysis (EDA) tool for generating time series plots snd scatterplots out of large datasets.
-- Automated predictive testing to examine how other geochemical variables might impact gas concentrations.

Third Party Notes

Earth Engine + associated datasets
USGS dataset
Raster packages: GDAL, rasterstats, geobricks, geopandas

https://pcjericks.github.io/py-gdalogr-cookbook/raster_layers.html

Feb 4 Next Steps:

1) Add data files to the repo in both shapefile and .csv format
2) Test the viability of using Python EE -will it actually serve our needs?
3) Basic visualization of map data in Jupyter
4) Find more examples online of what people have done that we have

WORK LOG NOTES FROM FRIDAY, FEB 19, 2016

In today's meeting, we hashed out several project steps in greater detail. We now visualize the project in three phases: cleaning, statistical validation, geographic visualizations and a showstopper (TBD). Our dataset contains over 25,500 records. Each record has an associated geographic identify, time stamp, and then a host of sensor-detected biogeochemical and physical attributes (i.e. pH, temperature, CO2, etc). As a csv, the data is around 6.1 MB. We need to be able to take the 25,000+ records and reduce them down to 10,000 so we can upload them into Fusion Tables for use in Earth Engine.

Cleaning (WG-M)

This phase has two parts. First, we need to smooth the time series data using a moving average. This will help adjust for erroneous measurements that might occur due to equipment error or other interruptions. In this step, it would be lovely to be able to tweak the window size in an automated fashion to figure out what the best window size is for the dataset (smooth but not too smooth). After the data has been smoothed, we then want to reduce the dimensionality by average readings in sets of five and assigning those average values to the middle time-stamp and lat long. This will cut our data down significantly making the dataset more usable. We are trading this ease-of-use off for spatial resolution as the new dataset will have points every couple hundred meters instead of every ten or so meters, but it is worth it to be able to import the data in Earth Engine and Fusion Tables.

Statistical Validation (RG)

In this phase, we want to be able to test out our data and look for relationships. We want to be able to compare carbon dioxide (CO2) with pH, dissolved oxygen (ODO), temperature, methane (CH4) and turbidity. We also want to be able to compare CH4 with pH, ODO, organic carbon (fDOM), temperature and turbidity. We want to build an automated package to generate scatterplots, with user input, for any of these combinations. The function would also then spit out a table of regression parameters so the user can quickly compare model fits and see how much variability each indicator might explain.

We also want to be able to test our readings for correlation with time and speed to check to make sure our sensor correct factor actually worked and isn't producing artifacts in our dataset.

Time Series Visualization (CF)

We want to make an automated time series plotting that allows the user to select the variable and the time step and create time series plots automatically so we can do it for many variables across either river dataset automatically.

Spatial Visualization (CK)

We want to build maps in Earth Engine of our data so we can look at landscape characteristics and compare them to our readings. (CK)

Showstopper

Check back for more details later!

February 19th Next Steps

Add csv files to the repo
Share interact plotting widget on the repo
Play wiht regression function
Build data reduction tool so we can import into fusion tables
Brainstorm way around using arcmap (csv to fusion tables)
create time series plotter