In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import nivapy3 as nivapy
import seaborn as sn
import matplotlib.pyplot as plt
import warnings
import os
import pickle
import itertools
import toc_trends_analysis as toc_trends
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
from IPython.display import Image

#warnings.filterwarnings("ignore")
plt.style.use('ggplot')
pandas2ri.activate()
bcp = importr('bcp')

In [ ]:
# Connect to db
eng = nivapy.da.connect()

TOC Thematic Report - February 2019 (Part 6a: analysis of annual trends; strict selection criteria)

Added 24.06.2019: This notebook implements the original workflow (prior to the 2019 Task Force meeting) using strict selection criteria for which sites to include. Following discussion at the TF meeting, I have also created a new notebook ("Part 6b"), which applices the same workflow with less strict cirteria (i.e. a broader range of sites).

Øyvind's outline for the Thematic Trends Report includes the following (see e-mail received 28/05/2019 at 12:06 for details):

My wish list for the presentation is as follows (most important first).

  1. MK and Sen-slope statistics (extended selection, 1990-2016)
  2. A comparison of trends in annual means and annual minimums for pH, ANC, alkalinity. Are they exactly the same? Here we need to use certain minimum criteria for sampling frequency (maybe minimum 4 per year for lakes and 12 per year for streams).
  3. A plot and table suitable for presenting the results of point 1
  4. Plots/ tables that are suitable for presenting the results (1-3).
  5. More BCP tests for breaking points applying a less strict criteria for sampling frequency.

In terms of the work done so far, notebook 4 produced "spaghetti"/scatter plots to provide an overall impression of the data, while notebook 5 focused on sites with high frequency data (~ weekly or better) and tested two different statistical algorithms for identifying trends and breakpoints. This notebook deals more explicitly with the points from Øyvind above. The workflow is broadly as follows:

  1. Identify sites with suitable sampling records within the period from 1990 to 2016

  2. Calculate annual medians (better than means, as requested by Øyvind?) for TOC, EH, ESO4X, ENO3, ECaX_EMgX and ANC at each site. Also calculate annual minima for the same parameters (to represent "severe of episodes")

  3. Generate plots for each site showing the Sen's slope for each of series listed above. Also apply the M-K test (which usually agrees with Sen's slope, but is worth having nonetheless)

  4. Apply the bcp algorithm from notebook 5 to the annual series for each site, attempting to identify breakpoints

  5. Regionalise/aggregate the results from 3 and 4 above so they can be presented concisely. When I get chance I'll also run regional M-K and/or Sen's slope tests, but for now the simplest (and perhaps clearest?) approach will be to explore e.g. the distribution of (significant) Sen's slopes per region. I will also plot the distribution of change point years per region to see if clear patterns emerge

Further refinements will be required, but this seems like a reasonable starting point given Øyvind's wish-list and current time constraints.

1. Select sites

The first step is to identify sites with sufficient sampling. Following discussion with Øyvind (see e-mails sent/received 28/05/2019), I will use the following criteria as a starting point:

  • For lakes. Aggregate to seasonal frequency and require that fewer than 25% of the seasonal values within the period from 1995 to 2011 are missing
  • For rivers. Aggregate to monthly frequency and require that fewer than 25% of the monthly values within the period from 1995 to 2011 are missing

The original idea was that lakes should have at least one sample per season in order for annual means/medians to be robust, while rivers need at least one sample per month. We are also primarily interested in long time series - preferably spanning the whole period from 1990 to 2016. However, these criteria are rather restrictive: only a few sites have genuinely continuous records from 1990 to 2016, at either seasonal or monthly frequency. We have therefore relaxed the requirements a little, by allowing some missing data before 1995 or after 2011, and requiring the series between these years to be at least 75% complete.

Note: The revised set of criteria are still quite restrictive - time series from some regions (e.g. Atlantic Canada and the Appalachains) are removed completely due to data gaps. We should perhaps make further adjustments here to get the right balance between robustness and inclusivity.


In [ ]:
# Read stations
stn_path = r'../../../all_icpw_sites_may_2019.xlsx'
stn_df = pd.read_excel(stn_path, sheet_name='all_icpw_stns')
stn_df.head()

In [ ]:
nivapy.spatial.quickmap(stn_df,
                        cluster=True,
                        popup='station_code')

We will apply different selection criteria for rivers versus lakes, so the next step is to extract this information from RESA.


In [ ]:
# Get lake or river from RESA
sql = ("SELECT station_id, lake_or_river "
       "FROM resa2.stations "
       "WHERE station_id IN %s" % str(tuple(stn_df['station_id'].values)))
lr_df = pd.read_sql(sql, eng)
stn_df = pd.merge(stn_df, lr_df, how='left', on='station_id')
stn_df.head()

In [ ]:
# Load saved chem data
wc_csv = r'../../../Thematic_Trends_Report_2019/working_chem.csv'
wc_df = pd.read_csv(wc_csv, encoding='utf-8')
wc_df['sample_date'] = pd.to_datetime(wc_df['sample_date'])
wc_df.head()

In [ ]:
# Melt to long format
df = wc_df.copy()
del df['station_code'], df['station_name'], df['depth1'], df['depth2']
df = pd.melt(df, id_vars=['station_id', 'sample_date'])
df.head()

The code below creates a dataframe showing which series at which sites meet the criteria defined above.


In [ ]:
# Threshold for percent null values in the interval 1995 to 2011
pct_null_thresh = 25

# Dict for results
inc_dict = {'station_id':[],
            'variable':[],
            'include':[],
           }

# Loop over time series
for stn_id in df['station_id'].unique():
    # Get river or lake
    wb_type = stn_df.query('station_id == @stn_id')['lake_or_river'].values[0]
        
    # Loop over variables
    for par in df['variable'].unique():    
        # Get data
        df2 = df.query("(station_id == @stn_id) and (variable == @par)")
        df2.set_index('sample_date', inplace=True)
        del df2['station_id'], df2['variable']
        
        # Process based on wb type
        if wb_type == 'R':
            # Monthly dates from 1990 - 2016
            all_dates = pd.date_range('1990-01-01', '2016-12-31', freq='M')
            dates_df = pd.DataFrame(index=all_dates)

            # Need at least 1 sample per month
            df2 = df2.resample('M').median()
            df2 = dates_df.join(df2)
            
            if pd.isna(df2['value']).all().all():
                # Not suitable
                inc_dict['station_id'].append(stn_id)
                inc_dict['variable'].append(par)
                inc_dict['include'].append('no') 
                
            else:
                # Need complete record between 1995 and 2011
                df3 = df2.truncate(before='1995-01-01', after='2011-12-31')
                pct_null = 100*pd.isna(df3['value']).sum()/len(df3)
                #if (pd.isna(df3['value']).sum() > 0):
                if pct_null > pct_null_thresh:
                    # Not suitable
                    inc_dict['station_id'].append(stn_id)
                    inc_dict['variable'].append(par)
                    inc_dict['include'].append('no') 

                else:
                    # Include
                    inc_dict['station_id'].append(stn_id)
                    inc_dict['variable'].append(par)
                    inc_dict['include'].append('yes')   
                    
        if wb_type == 'L':
            # Monthly dates from 1990 - 2016
            all_dates = pd.date_range('1990-01-01', '2016-12-31', freq='Q-FEB')
            dates_df = pd.DataFrame(index=all_dates)
            
            # Need at least 1 sample per season
            df2 = df2.resample('Q-FEB').median()
            df2 = dates_df.join(df2)
            
            if pd.isna(df2['value']).all().all():
                # Not suitable
                inc_dict['station_id'].append(stn_id)
                inc_dict['variable'].append(par)
                inc_dict['include'].append('no') 
            
            else:
                # Need complete record between 1995 and 2011
                df3 = df2.truncate(before='1995-01-01', after='2011-12-31')
                pct_null = 100*pd.isna(df3['value']).sum()/len(df3)
                #if (pd.isna(df3['value']).sum() > 0):
                if pct_null > pct_null_thresh:
                    # Not suitable
                    inc_dict['station_id'].append(stn_id)
                    inc_dict['variable'].append(par)
                    inc_dict['include'].append('no') 

                else:
                    # Include
                    inc_dict['station_id'].append(stn_id)
                    inc_dict['variable'].append(par)
                    inc_dict['include'].append('yes')

# Build df
inc_df = pd.DataFrame(inc_dict)

print('The number of stations with at least some time series meeting the '
      'specified criteria is', len(inc_df.query('include == "yes"')['station_id'].unique()))

inc_df.head()

231 out of 555 stations have at least one time series (i.e. one parameter) that meets the specified requirements.


In [ ]:
# Map of selected stations
sel_stn_list = list(inc_df.query('include == "yes"')['station_id'].unique())
sel_stns = stn_df.query('station_id in @sel_stn_list')
nivapy.spatial.quickmap(sel_stns,
                        cluster=True,
                        popup='station_code')

In [ ]:
# Save station list
sel_stns.to_csv(r'../../../Thematic_Trends_Report_2019/results/selected_stations.csv',
                index=False,
                encoding='utf-8')

2. M-K and Sen's slope for selected sites

Updated 08/11/2019.

The original code here considered medians ad minima. However, in an e-mail received 08.11.2019 at 14.13, Øyvind asked for comparisions of medians and maxima, just for the 231 stations. This notebook has been updated to reflect this, but note the notebook 6b (for the "relaxed criteria") still compared medians and minima.

The code below produces a "grid plot" showing the Sen's slope trend estimate for each data series at each site (only data series meeting the selection criteria are included). Each row corresponds to a parameter, and the first column is based on annual medians; the second column uses annual maxima.

M-K and Sen's slope results for all series are also saved to a CSV file for later use.


In [ ]:
%%capture
# Output folder
out_fold = r'../../../Thematic_Trends_Report_2019/results'
    
# Dicts for results
res_dict = {'station_id':[],
            'variable':[],
            'metric':[],
            'mk_p_val':[],
            'mk_trend':[],
            'sen_slp':[],
            'sen_incpt':[],
            'sen_trend':[],
           }

series_dict = {}

for stn_id in inc_df['station_id'].unique():
    # Determine whether to process this site
    inc_site_df = inc_df.query("(station_id == @stn_id) and (include == 'yes')")
    
    if len(inc_site_df) > 0:    
        # Setup plot
        fig, axes = plt.subplots(nrows=7, ncols=2, figsize=(15, 20))

        # Loop over variables
        for row_idx, par in enumerate(inc_df['variable'].unique()):
            # Determine whether to plot series
            inc = inc_df.query("(station_id == @stn_id) and (variable == @par)")['include'].values[0]

            if inc == 'no':
                # Plot "omitted" text
                axes[row_idx, 0].text(0.5, 0.5, 
                                      'Omitted due to lack of data',
                                      verticalalignment='center', 
                                      horizontalalignment='center',
                                      transform=axes[row_idx, 0].transAxes,
                                      fontsize=18)

                axes[row_idx, 1].text(0.5, 0.5, 
                                      'Omitted due to lack of data',
                                      verticalalignment='center', 
                                      horizontalalignment='center',
                                      transform=axes[row_idx, 1].transAxes,
                                      fontsize=18)
                
            else:
                # Get data
                df2 = df.query("(station_id == @stn_id) and (variable == @par)")
                df2.set_index('sample_date', inplace=True)
                del df2['station_id'], df2['variable']
                series_dict[(stn_id, par)] = df2                
                
                # Resample to annual medians and maxes
                for col_idx, stat in enumerate(['annual median', 'annual maximum']):
                    if stat == 'annual median':
                        df_stat = df2.resample('A').median()
                    else:
                        df_stat = df2.resample('A').max()
                
                    df_stat.index = df_stat.index.year
                    
                    # MK test
                    mk_df = nivapy.stats.mk_test(df_stat, 'value')
                
                    # Sen's slope
                    res_df, sen_df = nivapy.stats.sens_slope(df_stat, 
                                                             value_col='value',
                                                             index_col=df_stat.index)
                
                    # Add results to dict
                    res_dict['station_id'].append(stn_id)
                    res_dict['variable'].append(par)
                    res_dict['metric'].append(stat)
                    res_dict['mk_p_val'].append(mk_df.loc['p'].value)
                    res_dict['mk_trend'].append(mk_df.loc['trend'].value)
                    
                    sslp = res_df.loc['sslp'].value
                    sincpt = res_df.loc['icpt'].value
                    res_dict['sen_slp'].append(sslp)
                    res_dict['sen_incpt'].append(sincpt)
                    res_dict['sen_trend'].append(res_df.loc['trend'].value)
                                        
                    # Plot
                    axes[row_idx, col_idx].plot(sen_df.index, sen_df['value'].values, 'bo-')
                    axes[row_idx, col_idx].plot(sen_df.index, 
                                                sen_df.index*sslp + sincpt, 'k-')                
                
            if par == 'TOC':
                unit = 'mg-C/l'
            else:
                unit = 'µEq/l'
                
            axes[row_idx, 0].set_title('Annual median %s (%s)' % (par, unit))
            axes[row_idx, 1].set_title('Annual maximum %s (%s)' % (par, unit))
        
        plt.tight_layout()
        
        # Save
        out_png = os.path.join(out_fold, 'trends_plots_1990-2016/%s_trends_1990-2016.png' % stn_id)
        plt.savefig(out_png, dpi=200)
        plt.close()        

# Combine results
res_df = pd.DataFrame(res_dict)
out_csv = os.path.join(out_fold, 'trends_summary_1990-2016.csv')
res_df.to_csv(out_csv, index=False, encoding='utf-8')

# Save series
out_pkl = os.path.join(out_fold, 'series.pkl')
with open(out_pkl, 'wb') as handle:
    pickle.dump(series_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [ ]:
res_df.head()

Updated 08/11/2019.

The original code here considered medians ad minima. However, in an e-mail received 08.11.2019 at 14.13, Øyvind asked for comparisions of medians and maxima, just for the 231 stations. This notebook has been updated to reflect this, but note the notebook 6b (for the "relaxed criteria") still compared medians and minima.

Øyvind would like to know whether the trends in medians and maxima are consistent. One way to do this is to plot the Sen's slopes derived from medians against those based on maxima, split by parameter. Øyvind is primarily interested in EH, ANC, alkalinity, so I'll just focus on those here.


In [ ]:
# Cols of interest
cols = ['station_id', 'variable', 'metric', 'sen_slp']

# Setup plot
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10,15))

# Loop over pars
for idx, par in enumerate(['EH', 'ANC', 'ALK-E']):
    # Get data
    cols = ['station_id', 'variable', 'metric', 'sen_slp']
    comp_df = res_df.query("variable == @par")[cols].copy()
    del comp_df['variable']
    comp_df.set_index(['station_id', 'metric'], inplace=True)
    
    # Convert to 'wide' format
    comp_df = comp_df.unstack('metric')
    comp_df.columns = comp_df.columns.get_level_values('metric')

    # Plot
    axes[idx].plot(comp_df['annual median'], comp_df['annual maximum'], 'bo', label="Sen's slopes")
    axes[idx].plot(comp_df['annual median'], comp_df['annual median'], 'k-', label='1:1 line')
    axes[idx].set_xlabel('Slope for annual medians ($yr^{-1}$)', fontsize=14)
    axes[idx].set_ylabel('Slope for annual maxima ($yr^{-1}$)', fontsize=14)
    axes[idx].legend(loc='best', fontsize=14)
    axes[idx].set_title('%s (µEq/l)' % par)

plt.tight_layout()

# Save
out_png = os.path.join(out_fold, 'trend_maxima_vs_medians.png')
plt.savefig(out_png, dpi=200)

Another visualisation option is to create heatmaps showing whether trends in annual minima are classified as being "significant" in the same way as trends in annual medians. The labels on the plots below show the percentage of the total sites for each variable in each of 9 classes. High proportions along the diagonal indicate that results based on minima and medians are giving essentially the same overall picture. The overall direction of the trends is also very clear.


In [ ]:
# Cols of interest
cols = ['station_id', 'variable', 'metric', 'sen_trend']

# Setup plot
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20,5))

# Loop over pars
for idx, par in enumerate(['EH', 'ANC', 'ALK-E']):
    # Get data
    cols = ['station_id', 'variable', 'metric', 'sen_trend']
    comp_df = res_df.query("variable == @par")[cols].copy()
    del comp_df['variable']
    comp_df.set_index(['station_id', 'metric'], inplace=True)
    
    # Convert to 'wide' format
    comp_df = comp_df.unstack('metric')
    comp_df.columns = comp_df.columns.get_level_values('metric')
    comp_df.columns = [i.replace(' ', '_') for i in comp_df.columns]
    
    # Get counts for combos
    opt_list = ['decreasing', 'no trend', 'increasing']
    hmap = np.zeros(shape=(3,3))
    
    # Map options to array indices (med is x; min is y)
    med_map = {'decreasing':0,
               'no trend':1,
               'increasing':2,
              }
    
    min_map = {'decreasing':2,
               'no trend':1,
               'increasing':0,
              }
    
    # Assign counts to array
    for pair in itertools.product(opt_list, repeat=2):
        ann_med, ann_max = pair
        cnt = len(comp_df.query("(annual_median == @ann_med) and (annual_maximum == @ann_max)"))
        pct = 100*cnt/len(comp_df)
        hmap[min_map[ann_max], med_map[ann_med]] = pct

    # Plot
    g = sn.heatmap(hmap, 
                   ax=axes[idx], 
                   square=True, 
                   annot=True,
                   xticklabels=opt_list,
                   yticklabels=opt_list[::-1],
                   cmap='coolwarm',
                  )
    g.set_yticklabels(g.get_yticklabels(), rotation=0)
    
    axes[idx].set_xlabel('Annual medians', fontsize=16)
    axes[idx].set_ylabel('Annual maxima', fontsize=16)
    axes[idx].set_title('%s (%%)' % par)

plt.tight_layout()

# Save
out_png = os.path.join(out_fold, 'trend_maxima_vs_medians_heatmap.png')
plt.savefig(out_png, dpi=200)

4. Summarising by region

To present results concisely, we need to aggregate output from the analysis above. A simple way to get an overall picture of the dataset is to create box plots, violin plots and histograms illustrating the range of significant Sen's slopes for each parameter in each region (or country or continent). I will focus on annual medians here since, based on the output above, results for minima should be similar.


In [ ]:
# Get data
reg_df = res_df.query("(metric == 'annual median') and "
                      "(sen_trend in ('increasing', 'decreasing'))")

# Join regions
reg_df = pd.merge(reg_df, stn_df[['station_id', 'continent', 'country', 'region']],
                  how='left', on='station_id')

In [ ]:
# Boxplots by region
g = sn.catplot(data=reg_df,
               row='variable',
               x='region',
               y='sen_slp',
               kind='box',
               order=['NoNord', 'SoNord', 'Baltic', 'UK-IE-NL', 
                      'WCE', 'ECE', 'Alps', 'AtlCan', 'QuMaVt',
                      'AdsCsk', 'Apps', 'BRM', 'Ont'],
               sharex=False,
               sharey=False,
               aspect=3
              )

# Save
out_png = os.path.join(out_fold, 'slope_box_plots_by_region.png')
plt.savefig(out_png, dpi=200)

In [ ]:
# Violin plots by continent
g = sn.catplot(data=reg_df,
               col='variable',
               col_wrap=4,
               x='continent',
               y='sen_slp',
               kind='violin',
               sharex=False,
               sharey=False,
               aspect=1
              )

g.map(plt.axhline, y=0, lw=2, ls='--', c='k')

# Save
out_png = os.path.join(out_fold, 'slope_violin_plots_by_continent.png')
plt.savefig(out_png, dpi=200)

In [ ]:
# KDE-smoothed plots by continent
g = sn.FacetGrid(reg_df,
                 row="variable",
                 col='continent',
                 aspect=2,
                 sharex='row',
                 sharey=False)
g.map(sn.distplot, 'sen_slp', hist=False, rug=True)
g.map(plt.axvline, x=0, lw=2, ls='--', c='k')

# Save
out_png = os.path.join(out_fold, 'slope_kde_plots_by_continent.png')
plt.savefig(out_png, dpi=200)

Remember that all the plots above are different ways of illustrating the distributions of significant Sen's slopes. I think there are some interesting patterns here - especially comparing Europe with North America.

5. Change points

The code below uses the BCP R package and is modified from notebook 5.

For each of the selected time series at each site, I will run a change point analysis (without regression - see here for details) and produce a single plot showing all the data for each site. For each variable, I will also record years when the probabiltiy of change is (i) greater than 50% and (ii) greater than 75%. This will make it possible to look for consistent change points across regions.


In [ ]:
def listvector_to_dict(lv):
    """ Convert R ListVector to a Python dict.
    """
    return dict(zip(lv.names, map(list,list(lv))))

def plot_bcp_change_pts(stn_id, reg=True, p0=0.2):
    """ Run BCP analysis to detect change-points.
    """
    res_dict = {'station_id':[],
                'variable':[],
                'prob':[],
                'year':[],
               }
    
    # Setup plot
    fig, axes = plt.subplots(nrows=7, ncols=2, 
                             sharex=False,
                             sharey=False,
                             figsize=(15,15))
    
    # Loop over variables
    for row_idx, par in enumerate(inc_df['variable'].unique()):
        # Determine whether to plot series
        inc = inc_df.query("(station_id == @stn_id) and (variable == @par)")['include'].values[0]

        if inc == 'no':
            # Plot "omitted" text
            axes[row_idx, 0].text(0.5, 0.5, 
                                  'Omitted due to lack of data',
                                  verticalalignment='center', 
                                  horizontalalignment='center',
                                  transform=axes[row_idx, 0].transAxes,
                                  fontsize=18)
            axes[row_idx, 0].set_ylabel(par)

            axes[row_idx, 1].text(0.5, 0.5, 
                                  'Omitted due to lack of data',
                                  verticalalignment='center', 
                                  horizontalalignment='center',
                                  transform=axes[row_idx, 1].transAxes,
                                  fontsize=18)
            axes[row_idx, 1].set_ylabel('Probability')
            
        else:
            # Get data
            df2 = df.query("(station_id == @stn_id) and (variable == @par)")
            df2.set_index('sample_date', inplace=True)
            del df2['station_id'], df2['variable']
    
            # Resample
            df2 = df2.resample('A').median().reset_index()
            df2['year'] = df2['sample_date'].dt.year
            del df2['sample_date']

            # Interpolate years with missing data
            index_df = pd.DataFrame({'year':range(1990, 2017)})
            df2 = pd.merge(index_df, df2, how='left', on='year')
            df2.interpolate(kind='linear', inplace=True)
            df2.fillna(method='backfill', inplace=True)

            # Change point analysis
            if reg:
                # Perform regression within each partition
                res = bcp.bcp(df2['value'], 
                              x=df['year'], 
                              p0=p0)
            else:
                # Assume constant mean within each partition
                res = bcp.bcp(df2['value'], 
                              p0=p0)            

            res = pd.DataFrame({'raw':df2['value'].values,
                                'mean':res.rx2('posterior.mean').flatten(),
                                'prob':res.rx2('posterior.prob')},
                               index=df2['year'].values)

            # Add to results
            gt50 = np.array(res.index[res['prob'] > 0.5])
            for year in gt50:                        
                res_dict['station_id'].append(stn_id)
                res_dict['variable'].append(par)
                res_dict['prob'].append('gt50')
                res_dict['year'].append(year)
            
            gt75 = np.array(res.index[res['prob'] > 0.75])
            for year in gt75:                        
                res_dict['station_id'].append(stn_id)
                res_dict['variable'].append(par)
                res_dict['prob'].append('gt75')
                res_dict['year'].append(year)  
                
            # Plot
            res['raw'].plot(ax=axes[row_idx, 0], marker='o', linestyle=':', label='Raw')
            res['mean'].plot(ax=axes[row_idx, 0], label='Fitted')
            axes[row_idx, 0].set_ylabel(par)
            axes[row_idx, 0].legend(loc='best')

            res['prob'].plot(ax=axes[row_idx, 1], marker='o')
            axes[row_idx, 1].set_ylabel('Probability')
            axes[row_idx, 1].set_ylim((0, 1))
            axes[row_idx, 1].axhline(0.5, c='k', linestyle='--')
            axes[row_idx, 1].axhline(0.95, c='k', linestyle='--')

        axes[0, 0].set_title('Data series')    
        axes[0, 1].set_title('Change probability')
        
    plt.tight_layout()
    out_png = os.path.join(out_fold, 'bcp_selected_sites/bcp_stn_%s.png' % stn_id)
    plt.savefig(out_png, dpi=200)
    plt.close()
    
    res_df = pd.DataFrame(res_dict)
    
    return res_df

In [ ]:
%%capture
# Container for results
df_list = []

# Loop over stations
for stn_id in inc_df['station_id'].unique():
    # Determine whether to process this site
    inc_site_df = inc_df.query("(station_id == @stn_id) and (include == 'yes')")
    
    if len(inc_site_df) > 0: 
        # Run BCP
        res_bcp = plot_bcp_change_pts(stn_id, reg=False, p0=0.1)  
        df_list.append(res_bcp)
        
# Combine results
res_bcp = pd.concat(df_list, axis=0)

In [ ]:
# Show example BCP output
Image(r'../../../Thematic_Trends_Report_2019/results/bcp_selected_sites/bcp_stn_38242.png')

These results can now be aggregated regionally. I will produce four sets of plots:

  1. KDE-smoothed histograms showing years where the probabiltiy of change is >50%

    a. Split by region

    b. Split by continent

  2. KDE-smoothed histograms showing years where the probabiltiy of change is >75%

    a. Split by region

    b. Split by continent

For ease of comparison, columns represent parameters and rows are regions. Considering one column at a time therefore makes it posible to identify for similar patterns between regions.


In [ ]:
# Join regions
res_bcp = pd.merge(res_bcp, stn_df[['station_id', 'continent', 'country', 'region']],
                   how='left', on='station_id')

res_bcp.head()

In [ ]:
# KDE-smoothed plots by region
# Change prob > 50%
gt50_df = res_bcp.query('prob == "gt50"')

g = sn.FacetGrid(gt50_df,
                 row='region',
                 col='variable',
                 aspect=1.5,
                 sharex='row',
                 sharey=False,
                 row_order=['NoNord', 'SoNord', 'UK-IE-NL', 
                            'WCE', 'ECE', 'Alps', 'QuMaVt',
                            'AdsCsk', 'BRM', 'Ont'])
g.map(sn.distplot, 'year')
g.set(xlim=(1990, 2016))

# Save
out_png = os.path.join(out_fold, 'change_prob_kde_gt50_by_region.png')
plt.savefig(out_png, dpi=200)

In [ ]:
# KDE-smoothed plots by region
# Change prob > 75%
gt50_df = res_bcp.query('prob == "gt75"')

g = sn.FacetGrid(gt50_df,
                 row='region',
                 col='variable',
                 aspect=1.5,
                 sharex='row',
                 sharey=False,
                 row_order=['NoNord', 'SoNord', 'UK-IE-NL', 
                            'WCE', 'ECE', 'Alps', 'QuMaVt',
                            'AdsCsk', 'BRM', 'Ont'])
g.map(sn.distplot, 'year')
g.set(xlim=(1990, 2016))

# Save
out_png = os.path.join(out_fold, 'change_prob_kde_gt75_by_region.png')
plt.savefig(out_png, dpi=200)

In [ ]:
# KDE-smoothed plots by continent
# Change prob > 50%
gt50_df = res_bcp.query('prob == "gt50"')

g = sn.FacetGrid(gt50_df,
                 row='continent',
                 col='variable',
                 aspect=1.5,
                 sharex='row',
                 sharey=False)
g.map(sn.distplot, 'year')
g.set(xlim=(1990, 2016))

# Save
out_png = os.path.join(out_fold, 'change_prob_kde_gt50_by_continent.png')
plt.savefig(out_png, dpi=200)

In [ ]:
# KDE-smoothed plots by continent
# Change prob > 75%
gt50_df = res_bcp.query('prob == "gt75"')

g = sn.FacetGrid(gt50_df,
                 row='continent',
                 col='variable',
                 aspect=1.5,
                 sharex='row',
                 sharey=False)
g.map(sn.distplot, 'year')
g.set(xlim=(1990, 2016))

# Save
out_png = os.path.join(out_fold, 'change_prob_kde_gt75_by_continent.png')
plt.savefig(out_png, dpi=200)

I haven't had time to look at this output in detail, but there appears to be evidence for consistent regional changes in at least some variables. For example, ENO3X, ECaX_EMgX and ESO4X all show similar patterns in the SoNord, NoNord and UK+IE+NL regions (and perhaps for Europe as a whole too), with many change points occurring in the mid-1990s. For TOC, meanwhile, the most common European change points are about a decade later, in the mid-2000s.

6. Differences between medians and minima

Added 02/06/2019.

Øyvind would like to explore trends in the absolute differences between annual median and minimum values for ANC and pH. This is motivated by some interesting patterns previously documented at Øygardsbekken (station ID 38313) - see e-mail from Øyvind received 01/06/2019 at 17.47 for details.

As a sanity check, it's a good idea to first make sure I can reproduce the patterns in Øyvind's e-mail.


In [ ]:
# Load series
pkl = os.path.join(out_fold, 'series.pkl')
with open(pkl, 'rb') as handle:
    series_dict = pickle.load(handle)
    
# Get data for Øygardsbekken
stn_id = 38313

# Loop over dfs
df_list = []

for par in ['ANC', 'EH']:
    # Get data
    df = series_dict[(stn_id, par)]   
    
    # Convert EH to pH
    if par == 'EH':
        df['value'] = -np.log10(df['value']/1E6) 
        par = 'pH'
        
    # Resample to annual medians and mins
    for col_idx, stat in enumerate(['annual median', 'annual minimum']):
        if stat == 'annual median':
            df_stat = df.resample('A').median()
        else:
            df_stat = df.resample('A').min()

        df_stat['year'] = df_stat.index.year
        df_stat['metric'] = stat
        df_stat['variable'] = par
        df_stat.reset_index(inplace=True, drop=True)
            
        df_list.append(df_stat)
        
ann_df = pd.concat(df_list, axis=0)        
ann_df.head()

In [ ]:
# Plot
g = sn.lmplot(x='year', 
              y='value', 
              data=ann_df,
              col='variable',
              hue='metric',
              sharey=False,
              lowess=True)

g.axes[0, 0].set_ylim((-125, 75))
g.axes[0, 1].set_ylim((4, 6))

These plots look the same as in Øyvind's e-mail, which is good. The difference between minimum and median ANC looks fairly steady, whereas differences for pH are becoming larger through time, apparently because median pH is increasing more rapidly than minimum pH. The next step is to calculate absolute differences between median and minimum values for all sites and then test for trends.


In [ ]:
%%capture
# Load series
pkl = os.path.join(out_fold, 'series.pkl')
with open(pkl, 'rb') as handle:
    series_dict = pickle.load(handle)
    
# Dicts for results
res_dict = {'station_id':[],
            'variable':[],
            'mk_p_val':[],
            'mk_trend':[],
            'sen_slp':[],
            'sen_incpt':[],
            'sen_trend':[],
           }

for stn_id in inc_df['station_id'].unique():
    for par in ['ANC', 'EH']:
        try:
            # Get data
            df = series_dict[(stn_id, par)]   

            # Convert EH to pH
            if par == 'EH':
                df['value'] = -np.log10(df['value']/1E6) 
                par = 'pH'

            # Resample to annual medians and mins
            df_med = df.resample('A').median()
            df_min = df.resample('A').min()
            df_diff = df_med - df_min
            df_diff.index = df_diff.index.year

            # MK test
            mk_df = nivapy.stats.mk_test(df_diff, 'value')

            # Sen's slope
            res_df, sen_df = nivapy.stats.sens_slope(df_diff, 
                                                     value_col='value',
                                                     index_col=df_diff.index)

            # Add results to dict
            res_dict['station_id'].append(stn_id)
            res_dict['variable'].append(par)
            res_dict['mk_p_val'].append(mk_df.loc['p'].value)
            res_dict['mk_trend'].append(mk_df.loc['trend'].value)

            sslp = res_df.loc['sslp'].value
            sincpt = res_df.loc['icpt'].value
            res_dict['sen_slp'].append(sslp)
            res_dict['sen_incpt'].append(sincpt)
            res_dict['sen_trend'].append(res_df.loc['trend'].value)
                                       
        except KeyError:
            pass

# Combine results
res_df = pd.DataFrame(res_dict)
out_csv = os.path.join(out_fold, 'trends_in_differences.csv')
res_df.to_csv(out_csv, index=False, encoding='utf-8')

In [ ]:
res_df.head()

As a further quick check, here are the results for Øygardsbekken:


In [ ]:
# Get results for Øygardsbekken
res_df.query('station_id == 38313')

As expected, there is no trend for differences in ANC, but a significant increasing trend for differences in pH.

Next, join in the region data and explore patterns for all sites.


In [ ]:
# Just significant
df = res_df.query("sen_trend in ('increasing', 'decreasing')")

# Join regions
df = pd.merge(df, stn_df[['station_id', 'continent', 'country', 'region']],
              how='left', on='station_id')

In [ ]:
# Box plots of significant slopes in differences by region
g = sn.catplot(data=df,
               x='region',
               y='sen_slp',
               row='variable',
               kind='box', 
               order=['NoNord', 'SoNord', 'Baltic', 'UK-IE-NL', 
                      'WCE', 'ECE', 'Alps', 'AtlCan', 'QuMaVt',
                      'AdsCsk', 'Apps', 'BRM', 'Ont'],
               sharex=False,
               sharey=False,
               aspect=3,
              ) 

g.map(plt.axhline, y=0, lw=2, ls='--', c='k', alpha=0.4)

# Save
out_png = os.path.join(out_fold, 'med_min_diff_box_plots_by_region.png')
plt.savefig(out_png, dpi=200)

In [ ]:
# KDE-smoothed plots by continent
g = sn.FacetGrid(df,
                 row='variable',
                 col='continent',
                 aspect=2,
                 sharex='row',
                 sharey=False)
g.map(sn.distplot, 'sen_slp', hist=False, rug=True)
g.map(plt.axvline, x=0, lw=2, ls='--', c='k')

# Save
out_png = os.path.join(out_fold, 'med_min_diff_kde_plots_by_continent.png')
plt.savefig(out_png, dpi=200)

Results vary considerably between regions and, from a quick glance, I don't think there's a clear picture here. The increasing trend in pH differences at Øygardsbekken seems typical of some (but not all) sites in the Southern Nordic region, and also of sites in the UK, Ireland and the Netherlands. All these regions likely have fairly high marine inputs, which perhaps supports Øyvind's hypothesis that these patterns are driven by sea salt episodes.