In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
import imp
import glob
import os
import numpy as np
from collections import defaultdict
from sqlalchemy import create_engine
sn.set_context('notebook')
At the Task Force meeting in May 2017, it was decided that the TOC trends analysis should include rolling regressions based on the annually aggregated data (rather than calculating a single set of statistics for a small number of pre-specified time periods). Some code outlining an approach for this in Python can be found here here, and John has also created a version using R.
Heleen has asked me to provided John with annual time series for each site for the period from 1990 to 2012 (see e-mail received from Heleen on 12/05/2017 at 10.01 and also the e-mail chain involving John, Don and Heleen between 19th and 20th May 2017).
As a first step, I've modified my previous trends code so that, duirng the processing, the annual time series are saved as a series of CSVs. The additional lines of code can be found on lines 567 to 574 of toc_trend_analysis.py.
The output CSVs require a small amount of manual cleaning:
Delete the TOC series for station ID 23467 and
Delete the SO4 series for station ID 36561.
(See sections 1.2 and 1.3 of this notebook for justification).
Annual time series for climate based on the updated climate data have already been created and are saved here:
C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\CRU_Climate_Data\cru_climate_summaries.xlsx
The temperature series also need correcting based on site elevation, as was done in this notebook and the two (climate and water chemistry) datasets then need restructuring and joining into a single output file for John to work with.
In [2]:
# Data paths
clim_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\CRU_Climate_Data\cru_climate_summaries.xlsx')
stn_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\CRU_Climate_Data\cru_stn_elevs.csv')
chem_fold = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\annual_chemistry_series')
In [3]:
# For performance, pre-load the climate output file
# (this saves having to read it within the inner loop below,
# which is very slow)
clim_dict = {}
for var in ['pre', 'tmp']:
for tm in ['ann', 'jja', 'jas']:
# Open the climate data
clim_df = pd.read_excel(clim_xls, sheetname='%s_%s' % (var, tm))
clim_dict[(var, tm)] = clim_df
In [4]:
# Read stn elev data
stn_df = pd.read_csv(stn_xls)
# Get list of sites
stn_list = stn_df['stn_id'].unique()
# List to store output
data_list = []
# Loop over stations
for stn in stn_list:
# Read chem data
chem_path = os.path.join(chem_fold, 'stn_%s.csv' % stn)
# Only process the 431 files with chem data
if os.path.exists(chem_path):
# Allow for manually edited stations (see above)
# which now have ';' as the delimiter
if stn in [23467, 36561]:
chem_df = pd.read_csv(chem_path, sep=';')
else:
chem_df = pd.read_csv(chem_path)
chem_df.index = chem_df['YEAR']
# Process climate data
# Dict to store output
data_dict = {}
# Loop over data
for var in ['pre', 'tmp']:
for tm in ['ann', 'jja', 'jas']:
# Get the climate data
clim_df = clim_dict[(var, tm)]
# Filter the climate data for this station
stn_clim_df = clim_df.query('stn_id == @stn')
# Set index
stn_clim_df.index = stn_clim_df['year']
stn_clim_df = stn_clim_df.sort_index()
# Correct temperatures according to lapse rate
if var == 'tmp':
# Get elevations
stn_elev = stn_df.query('stn_id == @stn')['elev_m'].values[0]
px_elev = stn_df.query('stn_id == @stn')['px_elev_m'].values[0]
# If pixel elev is negative (i.e. in sea), correct back to s.l.
if px_elev < 0:
px_elev = 0
# Calculate temperature difference based on 0.6C/100m
t_diff = 0.6 * (px_elev - stn_elev) / 100.
# Apply correction
stn_clim_df['tmp'] = stn_clim_df['tmp'] + t_diff
# Truncate
stn_clim_df = stn_clim_df.query('(year>=1990) & (year<=2012)')
# Add to dict
key = '%s_%s' % (var, tm)
val = stn_clim_df[var]
data_dict[key] = val
# Build output df
stn_clim_df = pd.DataFrame(data_dict)
# Join chem and clim data
df = pd.merge(stn_clim_df, chem_df, how='outer',
left_index=True, right_index=True)
# Get desired columns
# Modified 06/06/2017 to include all pars for Leah
df = df[['pre_ann', 'pre_jas', 'pre_jja', 'tmp_ann', 'tmp_jas',
'tmp_jja', 'Al', 'TOC', 'EH', 'ESO4', 'ECl', 'ESO4_ECl',
'ENO3', 'ESO4X', 'ESO4_ECl', 'ECa_EMg', 'ECaX_EMgX',
'ANC']]
# Transpose
df = df.T
# Add station ID
df.reset_index(inplace=True)
df['station_id'] = stn
# Rename cols
df.columns.name = ''
cols = list(df.columns)
cols[0] = 'var'
df.columns = cols
data_list.append(df)
# Combine results for each site
ann_df = pd.concat(data_list, axis=0)
ann_df.head()
Out[4]:
The final step is to join in the site metadata used in the previous analysis.
In [5]:
# Read site data from previous output
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\toc_trends_long_format_update1.xlsx')
props_df = pd.read_excel(in_xls, sheetname='toc_trends_long_format_update1',
keep_default_na=False) # Otherwise 'NA' for North America becomes NaN
# Get just cols of interest
props_df = props_df[['project_id', 'project_name', 'station_id', 'station_code',
'station_name', 'nfc_code', 'type', 'continent', 'country',
'region', 'subregion', 'lat', 'lon']]
# Drop duplicates
props_df.drop_duplicates(inplace=True)
# Join
ann_df = pd.merge(ann_df, props_df, how='left',
on='station_id')
# Reorder cols
ann_df = ann_df[['project_id', 'project_name', 'station_id', 'station_code',
'station_name', 'nfc_code', 'type', 'continent', 'country',
'region', 'subregion', 'lat', 'lon', 'var']+range(1990, 2013)]
ann_df.head()
Out[5]:
Heleen previously defined various criteria for whether a series should be included in the analysis or not. In order to keep things consistent, it's probably a good idea to include this information here.
In [6]:
# Read site data from previous output
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\toc_trends_long_format_update1.xlsx')
inc_df = pd.read_excel(in_xls, sheetname='toc_trends_long_format_update1')
# Get just cols of interest
inc_df = inc_df[['station_id', 'par_id', 'analysis_period', 'include']]
# Filter to just results for 1990-2012
inc_df = inc_df.query('analysis_period == "1990-2012"')
# Join
ann_df = pd.merge(ann_df, inc_df, how='left',
left_on=['station_id', 'var'],
right_on=['station_id', 'par_id'])
# Reorder cols
ann_df = ann_df[['project_id', 'project_name', 'station_id', 'station_code',
'station_name', 'nfc_code', 'type', 'continent', 'country',
'region', 'subregion', 'lat', 'lon', 'var', 'include']+range(1990, 2013)]
# The climate vars all have data and can be included
ann_df['include'].fillna(value='yes', inplace=True)
# Write output
out_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\annual_clim_chem_series_leah.csv')
ann_df.to_csv(out_path, encoding='utf-8')
ann_df.head()
Out[6]:
Finally, it's worth checking that this output matches the time series available on the ICPW website and in the plots of the climate data.
In [7]:
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\annual_clim_chem_series.xlsx')
df = pd.read_excel(in_xls, sheetname='annual_clim_chem_series')
In [8]:
# Select series
stn_id = 23455
var = 'tmp_jja'
df2 = df.query("(station_id==@stn_id) & (var==@var)")
df2 = df2[range(1990, 2013)].T
df2.columns = [var]
df2.plot(ls='-', marker='o')
Out[8]:
This all seems OK.