In [1]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')
This notebook continues the work described here.
Heleen has taken a preliminary look at the trends results and has asked for some further processing. See the e-mail (and attached Word document) received 08/11/2016 at 14:25 for details.
It seems that some of the sites in RESA2 have been mistakenly assigned to Slovenia, rather than Slovakia. I've edited the entires for 12 stations in RESA2.STATIONS_PAR_VALUES
to correct this.
Heleen has identified problems with the TOC record in RESA for site 23467 (Lake 239, Ontario) - it looks as though some of the values are out by a factor of 100. These values are wrong in RESA, so the first step is to see whether it's an issue with the raw data supplied from the focal centre, or whether something went wrong with the upload process.
A quick manual check in RESA2 shows that the problem records are all from the period between spring 2000 and spring 2004. Unfortunately, I can't find the raw data for this site on the NIVA network, so my options for cleaning the data are limited. There are 17 values in the data series where $TOC \ge 500 \; mg/l$, which must surely be wrong? In fact, looking the the underlying Oracle database, there are many more TOC values greater than 500 mg/l. These don't appear in RESA2, though, so I'm not sure what's going on. I don't have time to dig into the wider database issues now, so in the code below I've simply removed the TOC data for this station from the analysis.
NB: This is not a long-term solution - the problems with RESA2 and the data in the database need correcting. What I've done below is just some post-processing of the trends results as a temporary fix.
Apparently the $SO_4$ record at this site is affected by a rapidly melting glacier, so the sulphate series (and presumably anything involving sulphate?) should be removed from the analysis. As with the TOC record mentioned above, I've simply filtered-out the problem data in a post-processing step, rather than correcting the problem in the database.
The code below reads the "long format" output from the previous notebook (toc_trends_long_format.csv) and performs some additional data cleaning.
In [2]:
# Read data
in_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\toc_trends_long_format.csv')
df = pd.read_csv(in_csv, encoding='utf-8')
# Remove TOC at site 23467
df = df.query('not((station_id==23467) and (par_id=="TOC"))')
# Remove sulphate-related series at site 36561
df = df.query('not((station_id==36561) and ((par_id=="ESO4") or '
'(par_id=="ESO4X") or '
'(par_id=="ESO4_ECl")))')
In [3]:
# Read Heleen's modified spreadsheet
in_xls = (r'K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI'
r'\Database\2015 DOC analysis\2016\2016-11\trends_sites_oct_2016.xlsx')
stn_df = pd.read_excel(in_xls, sheetname='data', keep_default_na=False) # Otherwise 'NA' for North America becomes NaN
# Remove unwanted cols
stn_df.drop(labels=['station_code', 'station_name',
'nfc_code', 'type', 'lat', 'lon'],
axis=1, inplace=True)
# Remove 'country' from original output (it includes
# 'Slovenia', whereas 'country' in stn_df is corrected to
# 'Slovakia')
del df['country']
# Join
df = pd.merge(df, stn_df, how='left', on='station_id')
In [4]:
# Relative slope
df['rel_sen_slp'] = df['sen_slp'] / df['median']
In [5]:
# Remove unwanted cols
df.drop(labels=['mean', 'n_end', 'n_start', 'mk_stat', 'norm_mk_stat'],
axis=1, inplace=True)
# Reorder columns
df = df[['project_id', 'project_name', 'station_id', 'station_code',
'station_name', 'nfc_code', 'type', 'continent', 'country',
'region', 'subregion', 'lat', 'lon', 'analysis_period',
'data_period', 'par_id', 'non_missing', 'median', 'std_dev',
'mk_p_val', 'trend', 'sen_slp', 'rel_sen_slp', 'include']]
# Write to output
res_fold = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Results'
out_path = os.path.join(res_fold, 'toc_trends_long_format_update1.csv')
df.to_csv(out_path, index=False, encoding='utf-8')
df.head()
Out[5]:
I can now repeat the code from the previous notebook (with a few modifications) to convert this dataframe from "long" to "wide" format.
In [6]:
del df['data_period']
# Melt to "long" format
melt_df = pd.melt(df,
id_vars=['project_id', 'project_name', 'station_id', 'station_code',
'station_name', 'nfc_code', 'type', 'continent', 'country',
'region', 'subregion', 'lat', 'lon', 'analysis_period',
'par_id', 'include'],
var_name='stat')
# Get only values where include='yes'
melt_df = melt_df.query('include == "yes"')
del melt_df['include']
# Build multi-index on everything except "value"
melt_df.set_index(['project_id', 'project_name', 'station_id', 'station_code',
'station_name', 'nfc_code', 'type', 'continent', 'country',
'region', 'subregion', 'lat', 'lon', 'par_id', 'analysis_period',
'stat'], inplace=True)
# Unstack levels of interest to columns
wide_df = melt_df.unstack(level=['par_id', 'analysis_period', 'stat'])
# Drop unwanted "value" level in index
wide_df.columns = wide_df.columns.droplevel(0)
# Replace multi-index with separate components concatenated with '_'
wide_df.columns = ["_".join(item) for item in wide_df.columns]
# Reset multiindex on rows
wide_df = wide_df.reset_index()
# Save output
out_path = os.path.join(res_fold, 'toc_trends_wide_format_update1.csv')
wide_df.to_csv(out_path, index=False, encoding='utf-8')
wide_df.head()
Out[6]:
Don has supplied climate data for (most of?) the sites of interest. The most recent climate files are here:
K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\2015 DOC analysis\climate data\2016-02-02 from don
(one file for precipitation and one for temperature).
Each file includes a range of climate statistics for each site. Linking these datasets to the TOC trends output is probably best done by restructuring the climate data to "wide" format (i.e. one row per site) and then joining it to the "wide" trends results.
Update 09/12/2016: Heleen has pointed out that some of the sites seem to be missing climate data. The problems concern all 8 Czech sites, plus Newbert Pond and Big Hope Pond in the USA. As far as I can tell, climate data has not been provided for the two US sites. The Czech sites are missing climate information because I've re-entered them with new site codes (see section 3 of this notebook for details), so my code (below) was originally failing to match Don's climate output (using the old codes) to my statistical summaries (using the new codes). I've therefore created two copies of the climate data, Precip_res_NEW_corr.csv and Temp_res_NEW_corr.csv, where I've changed the codes for the Czech sites to match the current codes used in RESA2. Note, however, that climate data is not available for Uhlirska (CZ08), but is present for the other 7 Czech sites.
Update 23/02/2017: I have now updated and reprocessed the climate data - see the notebooks here and here. The code in the cell immediately below is therefore unnecessary, as I am no longer using the climate data from Don. The code cell below this joins my new climate data to wide_df
.
In [7]:
## Process raw climate data
## File paths
#pptn_csv = (r'K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI'
# r'\Database\2015 DOC analysis\climate data\2016-02-02 from don\Precip_res_NEW_corr.csv')
#
#temp_csv = (r'K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI'
# r'\Database\2015 DOC analysis\climate data\2016-02-02 from don\Temp_res_NEW_corr.csv')
#
## Container for DFs
#df_list = []
#
## Loop over files
#for csv in [pptn_csv, temp_csv]:
# # Read data
# df = pd.read_csv(csv, delimiter=';')
#
# # Melt
# df = pd.melt(df, id_vars=['StationID', 'Variable'],
# var_name='Param', value_name='value')
#
# # Concat 'variable' and 'param' cols
# # Convert to lower case
# df['variable'] = df['Variable'].str.lower() + '_' + df['Param'].str.lower()
#
# # Tidy
# df['station_id'] = df['StationID']
# del df['Param'], df['Variable'], df['StationID']
#
# # Pivot
# df = df.pivot(index='station_id', columns='variable',
# values='value')
#
# # Add to list
# df_list.append(df)
#
## Concat pptn and temp data
#df = pd.concat(df_list, axis=1)
#
## Reset index and tidy
#df.reset_index(inplace=True)
#df.columns.name = None
#
#df.head()
Out[7]:
In [7]:
# Read updated climate data
in_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\CRU_Climate_Data\icpw_climate_stats.csv')
df = pd.read_csv(in_csv)
# Rename stn_id col
df['station_id'] = df['stn_id']
del df['stn_id']
df.head()
Out[7]:
This can now be joined to the "wide" format trends data.
In [8]:
# Join climate to trends
clim_df = pd.merge(wide_df, df, how='left', on='station_id')
# Save output
out_path = os.path.join(res_fold, 'toc_trends_wide_format_up1_climate.csv')
clim_df.to_csv(out_path, index=False, encoding='utf-8')
clim_df.head()
Out[8]: