In [1]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')

TOC trends October 2016 (part 3)

This notebook continues the work described here.

Heleen has taken a preliminary look at the trends results and has asked for some further processing. See the e-mail (and attached Word document) received 08/11/2016 at 14:25 for details.

1. Fix database errors

1.1. Change Slovenia to Slovakia

It seems that some of the sites in RESA2 have been mistakenly assigned to Slovenia, rather than Slovakia. I've edited the entires for 12 stations in RESA2.STATIONS_PAR_VALUES to correct this.

1.2. TOC for station ID 23467

Heleen has identified problems with the TOC record in RESA for site 23467 (Lake 239, Ontario) - it looks as though some of the values are out by a factor of 100. These values are wrong in RESA, so the first step is to see whether it's an issue with the raw data supplied from the focal centre, or whether something went wrong with the upload process.

A quick manual check in RESA2 shows that the problem records are all from the period between spring 2000 and spring 2004. Unfortunately, I can't find the raw data for this site on the NIVA network, so my options for cleaning the data are limited. There are 17 values in the data series where $TOC \ge 500 \; mg/l$, which must surely be wrong? In fact, looking the the underlying Oracle database, there are many more TOC values greater than 500 mg/l. These don't appear in RESA2, though, so I'm not sure what's going on. I don't have time to dig into the wider database issues now, so in the code below I've simply removed the TOC data for this station from the analysis.

NB: This is not a long-term solution - the problems with RESA2 and the data in the database need correcting. What I've done below is just some post-processing of the trends results as a temporary fix.

1.3. SO4 for station ID 36561

Apparently the $SO_4$ record at this site is affected by a rapidly melting glacier, so the sulphate series (and presumably anything involving sulphate?) should be removed from the analysis. As with the TOC record mentioned above, I've simply filtered-out the problem data in a post-processing step, rather than correcting the problem in the database.

The code below reads the "long format" output from the previous notebook (toc_trends_long_format.csv) and performs some additional data cleaning.


In [2]:
# Read data
in_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
          r'\Results\toc_trends_long_format.csv')
df = pd.read_csv(in_csv, encoding='utf-8')

# Remove TOC at site 23467
df = df.query('not((station_id==23467) and (par_id=="TOC"))')

# Remove sulphate-related series at site 36561
df = df.query('not((station_id==36561) and ((par_id=="ESO4") or '
                                           '(par_id=="ESO4X") or '
                                           '(par_id=="ESO4_ECl")))')

Heleen has added some columns to my Excel file of basic site properties:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\2015 DOC analysis\2016\2016-11\trends_sites_oct_2016.xlsx

These need joining in.


In [3]:
# Read Heleen's modified spreadsheet
in_xls = (r'K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI'
          r'\Database\2015 DOC analysis\2016\2016-11\trends_sites_oct_2016.xlsx')
stn_df = pd.read_excel(in_xls, sheetname='data', keep_default_na=False) # Otherwise 'NA' for North America becomes NaN

# Remove unwanted cols
stn_df.drop(labels=['station_code', 'station_name',
                    'nfc_code', 'type', 'lat', 'lon'],
            axis=1, inplace=True)

# Remove 'country' from original output (it includes
# 'Slovenia', whereas 'country' in stn_df is corrected to
# 'Slovakia')
del df['country']

# Join
df = pd.merge(df, stn_df, how='left', on='station_id')

3. Relative slope

Heleen would like an additional column called rel_sen_slp, calculated as sen_slp / median.


In [4]:
# Relative slope
df['rel_sen_slp'] = df['sen_slp'] / df['median']

4. Delete unwanted columns

We do not need the following columns: ‘mean’, ‘n_end’, ‘n_start’, ‘mk_stat’ and ‘norm_mk_stat’.


In [5]:
# Remove unwanted cols
df.drop(labels=['mean', 'n_end', 'n_start', 'mk_stat', 'norm_mk_stat'],
        axis=1, inplace=True)

# Reorder columns
df = df[['project_id', 'project_name', 'station_id', 'station_code', 
         'station_name', 'nfc_code', 'type', 'continent', 'country', 
         'region', 'subregion', 'lat', 'lon', 'analysis_period',
         'data_period', 'par_id', 'non_missing', 'median', 'std_dev',
         'mk_p_val', 'trend', 'sen_slp', 'rel_sen_slp', 'include']]

# Write to output
res_fold = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Results'
out_path = os.path.join(res_fold, 'toc_trends_long_format_update1.csv')
df.to_csv(out_path, index=False, encoding='utf-8')

df.head()


Out[5]:
project_id project_name station_id station_code station_name nfc_code type continent country region ... data_period par_id non_missing median std_dev mk_p_val trend sen_slp rel_sen_slp include
0 4012 ICPW_TOCTRENDS_2015_NO 100 623-603 Breidlivatnet NaN L EUR Norway SoNord ... 1990-2012 Al 1 276.433655 NaN NaN NaN NaN NaN no
1 4012 ICPW_TOCTRENDS_2015_NO 100 623-603 Breidlivatnet NaN L EUR Norway SoNord ... 1990-2012 TOC 21 6.000000 1.184066 8.803429e-04 increasing 0.116569 0.019428 yes
2 4012 ICPW_TOCTRENDS_2015_NO 100 623-603 Breidlivatnet NaN L EUR Norway SoNord ... 1990-2012 EH 21 9.772372 3.596191 3.595511e-04 decreasing -0.380255 -0.038911 yes
3 4012 ICPW_TOCTRENDS_2015_NO 100 623-603 Breidlivatnet NaN L EUR Norway SoNord ... 1990-2012 ESO4 21 29.166667 15.103219 1.982305e-07 decreasing -2.051373 -0.070333 yes
4 4012 ICPW_TOCTRENDS_2015_NO 100 623-603 Breidlivatnet NaN L EUR Norway SoNord ... 1990-2012 ECl 21 17.142857 3.824173 1.272998e-03 decreasing -0.351074 -0.020479 yes

5 rows × 24 columns

I can now repeat the code from the previous notebook (with a few modifications) to convert this dataframe from "long" to "wide" format.


In [6]:
del df['data_period']

# Melt to "long" format
melt_df = pd.melt(df, 
                  id_vars=['project_id', 'project_name', 'station_id', 'station_code', 
                           'station_name', 'nfc_code', 'type', 'continent', 'country', 
                           'region', 'subregion', 'lat', 'lon', 'analysis_period',
                           'par_id', 'include'],
                  var_name='stat')

# Get only values where include='yes'
melt_df = melt_df.query('include == "yes"')
del melt_df['include']

# Build multi-index on everything except "value"
melt_df.set_index(['project_id', 'project_name', 'station_id', 'station_code', 
                   'station_name', 'nfc_code', 'type', 'continent', 'country', 
                   'region', 'subregion', 'lat', 'lon', 'par_id', 'analysis_period', 
                   'stat'], inplace=True)

# Unstack levels of interest to columns
wide_df = melt_df.unstack(level=['par_id', 'analysis_period', 'stat'])

# Drop unwanted "value" level in index
wide_df.columns = wide_df.columns.droplevel(0)

# Replace multi-index with separate components concatenated with '_'
wide_df.columns = ["_".join(item) for item in wide_df.columns]

# Reset multiindex on rows
wide_df = wide_df.reset_index()

# Save output
out_path = os.path.join(res_fold, 'toc_trends_wide_format_update1.csv')
wide_df.to_csv(out_path, index=False, encoding='utf-8')

wide_df.head()


Out[6]:
project_id project_name station_id station_code station_name nfc_code type continent country region ... EH_1998-2012_rel_sen_slp ESO4_1998-2012_rel_sen_slp ECl_1998-2012_rel_sen_slp ENO3_1998-2012_rel_sen_slp ESO4X_1998-2012_rel_sen_slp ESO4_ECl_1998-2012_rel_sen_slp ECa_EMg_1998-2012_rel_sen_slp ECaX_EMgX_1998-2012_rel_sen_slp ANC_1998-2012_rel_sen_slp Al_1998-2012_rel_sen_slp
0 3810 ICPW_TOCTRENDS_2015_FI 23542 FI01 Hirvilampi NaN L EUR Finland SoNord ... 0 -0.0387755 -0.02 -0.00606061 -0.0392665 -0.033388 -0.0176707 -0.0172829 0.0395843 0
1 3810 ICPW_TOCTRENDS_2015_FI 23545 FI05 Suopalampi NaN L EUR Finland NoNord ... -0.0045469 -0.0277778 -0.0113636 -0.111111 -0.0261829 -0.0290927 0 0 0.0110313 None
2 3810 ICPW_TOCTRENDS_2015_FI 23546 FI06 Vasikkajärvi NaN L EUR Finland NoNord ... 0.0226897 -0.032967 -0.00909091 -0.142857 -0.0336536 -0.0270479 -0.0214286 -0.0219901 0.0248658 None
3 3810 ICPW_TOCTRENDS_2015_FI 23547 FI07 Vitsjön NaN L EUR Finland SoNord ... 0.0135936 -0.0428994 -0.00165837 0.0257937 -0.046445 -0.0238486 -0.0237154 -0.0263162 0.00456399 None
4 3810 ICPW_TOCTRENDS_2015_FI 23548 FI08 Kakkisenlampi NaN L EUR Finland SoNord ... -0.00582281 -0.0321429 0 -0.115646 -0.0328831 -0.0258197 -0.025974 -0.029293 0.257827 None

5 rows × 244 columns

5. Climate data

Don has supplied climate data for (most of?) the sites of interest. The most recent climate files are here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\2015 DOC analysis\climate data\2016-02-02 from don

(one file for precipitation and one for temperature).

Each file includes a range of climate statistics for each site. Linking these datasets to the TOC trends output is probably best done by restructuring the climate data to "wide" format (i.e. one row per site) and then joining it to the "wide" trends results.

Update 09/12/2016: Heleen has pointed out that some of the sites seem to be missing climate data. The problems concern all 8 Czech sites, plus Newbert Pond and Big Hope Pond in the USA. As far as I can tell, climate data has not been provided for the two US sites. The Czech sites are missing climate information because I've re-entered them with new site codes (see section 3 of this notebook for details), so my code (below) was originally failing to match Don's climate output (using the old codes) to my statistical summaries (using the new codes). I've therefore created two copies of the climate data, Precip_res_NEW_corr.csv and Temp_res_NEW_corr.csv, where I've changed the codes for the Czech sites to match the current codes used in RESA2. Note, however, that climate data is not available for Uhlirska (CZ08), but is present for the other 7 Czech sites.

Update 23/02/2017: I have now updated and reprocessed the climate data - see the notebooks here and here. The code in the cell immediately below is therefore unnecessary, as I am no longer using the climate data from Don. The code cell below this joins my new climate data to wide_df.


In [7]:
## Process raw climate data
## File paths
#pptn_csv = (r'K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI'
#            r'\Database\2015 DOC analysis\climate data\2016-02-02 from don\Precip_res_NEW_corr.csv')
#
#temp_csv = (r'K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI'
#            r'\Database\2015 DOC analysis\climate data\2016-02-02 from don\Temp_res_NEW_corr.csv')
#
## Container for DFs
#df_list = []
#
## Loop over files
#for csv in [pptn_csv, temp_csv]:
#    # Read data
#    df = pd.read_csv(csv, delimiter=';')
#
#    # Melt
#    df = pd.melt(df, id_vars=['StationID', 'Variable'],
#                 var_name='Param', value_name='value')
#    
#    # Concat 'variable' and 'param' cols
#    # Convert to lower case
#    df['variable'] = df['Variable'].str.lower() + '_' + df['Param'].str.lower()
#    
#    # Tidy
#    df['station_id'] = df['StationID']
#    del df['Param'], df['Variable'], df['StationID']
#    
#    # Pivot
#    df = df.pivot(index='station_id', columns='variable',
#                  values='value')
#    
#    # Add to list
#    df_list.append(df)
#
## Concat pptn and temp data
#df = pd.concat(df_list, axis=1)
#
## Reset index and tidy
#df.reset_index(inplace=True)
#df.columns.name = None
#
#df.head()


Out[7]:
station_id precip_jas_lm_pval_90_04 precip_jas_lm_pval_90_12 precip_jas_lm_pval_98_12 precip_jas_lm_slope_90_04 precip_jas_lm_slope_90_12 precip_jas_lm_slope_98_12 precip_jas_manken_pval_90_04 precip_jas_manken_pval_90_12 precip_jas_manken_pval_98_12 ... temp_yr_lm_pval_98_12 temp_yr_lm_slope_90_04 temp_yr_lm_slope_90_12 temp_yr_lm_slope_98_12 temp_yr_manken_pval_90_04 temp_yr_manken_pval_90_12 temp_yr_manken_pval_98_12 temp_yr_senslope90_04 temp_yr_senslope90_12 temp_yr_senslope98_12
0 100 0.325835 0.000240 0.003147 2.503929 7.122036 10.446429 0.766525 0.006020 0.113287 ... 0.731979 0.030417 0.017762 -0.016012 0.276278 0.204904 0.843085 0.046970 0.038889 0.033333
1 101 0.685728 0.001986 0.009279 0.941429 5.029051 8.010000 0.692181 0.008265 0.029448 ... 0.579869 0.028899 0.012467 -0.026905 0.373053 0.267327 0.843085 0.034722 0.029167 0.018333
2 102 0.647042 0.001862 0.013072 1.148214 5.088933 8.318214 0.346495 0.003980 0.022822 ... 0.433407 0.021250 0.001952 -0.040744 0.620691 0.443575 1.000000 0.026042 0.025000 -0.006061
3 103 0.394579 0.000502 0.002914 2.205357 6.711759 10.253214 0.620691 0.015109 0.137646 ... 0.834824 0.007351 0.009247 -0.009107 1.000000 0.398036 0.620691 0.013333 0.025000 0.027778
4 104 0.305090 0.000457 0.004516 2.369643 6.359486 9.833929 0.620691 0.005118 0.037667 ... 0.667430 0.024970 0.013019 -0.020565 0.346495 0.234485 0.843085 0.027083 0.030556 0.025000

5 rows × 97 columns


In [7]:
# Read updated climate data
in_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
          r'\CRU_Climate_Data\icpw_climate_stats.csv')

df = pd.read_csv(in_csv)

# Rename stn_id col
df['station_id'] = df['stn_id']
del df['stn_id']

df.head()


Out[7]:
pre_ann_1990-2004_sig pre_ann_1990-2004_slp pre_ann_1990-2012_avg pre_ann_1990-2012_sig pre_ann_1990-2012_slp pre_ann_1998-2012_sig pre_ann_1998-2012_slp pre_jas_1990-2004_sig pre_jas_1990-2004_slp pre_jas_1990-2012_avg ... tmp_jas_1998-2012_sig tmp_jas_1998-2012_slp tmp_jja_1990-2004_sig tmp_jja_1990-2004_slp tmp_jja_1990-2012_avg tmp_jja_1990-2012_sig tmp_jja_1990-2012_slp tmp_jja_1998-2012_sig tmp_jja_1998-2012_slp station_id
0 0.692181 -1.400000 1417.886957 0.107049 8.312500 0.322300 12.266667 0.552615 -1.937500 298.917391 ... 0.091661 0.066667 0.691463 -0.022222 18.673623 0.177143 0.027778 0.081880 0.056667 37284
1 0.373053 -8.233333 1470.369565 0.832666 -2.420000 0.692181 -3.620000 1.000000 -0.114286 277.091304 ... 0.215458 0.044444 0.655647 -0.013333 17.394667 0.060503 0.033333 0.112404 0.036364 37320
2 0.322300 -6.025000 1432.421739 0.672611 1.916667 0.843085 4.700000 0.843085 -0.525000 293.239130 ... 0.254453 0.057143 0.519496 -0.009524 17.381797 0.096025 0.036667 0.198211 0.050000 37300
3 0.322300 -6.025000 1432.421739 0.672611 1.916667 0.843085 4.700000 0.843085 -0.525000 293.239130 ... 0.254453 0.057143 0.519496 -0.009524 17.201797 0.096025 0.036667 0.198211 0.050000 37311
4 0.322300 -6.025000 1432.421739 0.672611 1.916667 0.843085 4.700000 0.843085 -0.525000 293.239130 ... 0.254453 0.057143 0.519496 -0.009524 17.411797 0.096025 0.036667 0.198211 0.050000 37301

5 rows × 43 columns

This can now be joined to the "wide" format trends data.


In [8]:
# Join climate to trends
clim_df = pd.merge(wide_df, df, how='left', on='station_id')

# Save output
out_path = os.path.join(res_fold, 'toc_trends_wide_format_up1_climate.csv')
clim_df.to_csv(out_path, index=False, encoding='utf-8')

clim_df.head()


Out[8]:
project_id project_name station_id station_code station_name nfc_code type continent country region ... tmp_jas_1990-2012_slp tmp_jas_1998-2012_sig tmp_jas_1998-2012_slp tmp_jja_1990-2004_sig tmp_jja_1990-2004_slp tmp_jja_1990-2012_avg tmp_jja_1990-2012_sig tmp_jja_1990-2012_slp tmp_jja_1998-2012_sig tmp_jja_1998-2012_slp
0 3810 ICPW_TOCTRENDS_2015_FI 23542 FI01 Hirvilampi NaN L EUR Finland SoNord ... 0.075000 0.619832 0.051852 0.346495 0.066667 16.542406 0.266993 0.042424 0.655647 0.016667
1 3810 ICPW_TOCTRENDS_2015_FI 23545 FI05 Suopalampi NaN L EUR Finland NoNord ... 0.091667 0.399612 0.050000 0.007533 0.116667 13.525942 0.072509 0.050000 0.843085 -0.008333
2 3810 ICPW_TOCTRENDS_2015_FI 23546 FI06 Vasikkajärvi NaN L EUR Finland NoNord ... 0.091667 0.399612 0.050000 0.007533 0.116667 13.429942 0.072509 0.050000 0.843085 -0.008333
3 3810 ICPW_TOCTRENDS_2015_FI 23547 FI07 Vitsjön NaN L EUR Finland SoNord ... 0.062500 0.655647 0.033333 0.276278 0.091667 15.834435 0.427425 0.046667 0.881835 0.018182
4 3810 ICPW_TOCTRENDS_2015_FI 23548 FI08 Kakkisenlampi NaN L EUR Finland SoNord ... 0.107407 0.234955 0.083333 0.037667 0.103333 15.269768 0.036874 0.082456 0.921159 0.022222

5 rows × 286 columns