In [1]:
%matplotlib inline
import pandas as pd, imp
from sqlalchemy import create_engine

Update TOC trends analysis

Tore has previously written code to calculate Mann-Kendall (M-K) trend statistics and Sen's slope estimates for data series in RESA2. According to my notes from a meeting with Tore on 13/05/2016, the workflow goes something like this:

  1. Run code to extract and summarise time series from RESA2, insert this data into Mann-Kendall_Sen.xls, then read the results back into a new table in RESA2 called e.g. ICPW_STATISTICS.

  2. Run the ICPStat query in Icp-waters2001_2000.accdb to summarise the data in ICPW_STATISTICS. This creates a new table currently called aaa, but Tore says he'll rename it to something more descriptive before he leaves.

  3. Run the export() subroutine in the Export module of Icp-waters2001_2000.accdb to reformat the aaa table and write the results to an Excel file.

Mann-Kendall_Sen.xls is an early version of the popular Excel macro MULTIMK/CONDMK, which Tore has modified slightly for use in this analysis. (A more recent version of the same file is available here). This Excel macro permits some quite sophisticated multivariate and conditional analyses, but as far as I can tell the TOC trends code is only making use of the most basic functionality - performing repeated independent trend tests on annually summarised time series.

Unfortunately, although the workflow above makes sense, I've so far failed to find and run Tore's code for step 1 (I can find everything for steps 2 and 3, but not the code for interacting with the Excel workbook). It also seems a bit messy to be switching back and forth between RESA2, Excel and Access in this way, so the code here is a first step towards refactoring the whole analysis into Python.

1. Test data

The Mann-Kendall_Sen.xls file on the network already had some example ICPW data in it, which I can use to test my code. The raw input data and the results obtained from the Excel macro are saved as mk_sen_test_data.xlsx.


In [2]:
# Read data and results from the Excel macro
in_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\Data\mk_sen_test_data.xlsx')

raw_df = pd.read_excel(in_xlsx, sheetname='input')
res_df = pd.read_excel(in_xlsx, sheetname='results')

raw_df


Out[2]:
STATION_ID YEAR ESO4 ESO4X ECL ESO4CL TOC_DOC ECAEMG ECAXEMGX ENO3 AL ANC ALK HPLUSS
0 196 2002 65.37472 37.48427 270.78103 336.15575 1.1 90.04695 26.95497 25.35714 NaN -29.62386 0.0000 6.45654
1 196 2003 67.66492 41.80815 251.03658 318.70150 1.2 91.65460 33.16308 24.28571 NaN -16.01517 0.0000 7.58578
2 196 2004 58.29593 37.40714 202.80371 261.09964 1.5 76.21036 28.95710 17.85714 116.04925 -10.20958 0.0000 5.49541
3 196 2005 63.29272 31.04439 313.09057 376.38329 1.5 102.44073 29.49062 17.85714 NaN -8.89069 0.0000 8.91251
4 196 2006 58.50413 33.60609 241.72849 300.23262 1.6 92.07006 35.74731 16.78571 NaN -0.40248 0.0000 7.24436
5 196 2007 57.25493 26.74975 296.16676 353.42169 1.9 97.15416 28.14731 19.64286 NaN -0.34566 5.2920 3.38844
6 196 2008 57.46313 26.95795 296.16676 353.62989 1.3 96.15616 27.14931 15.00000 NaN -4.50689 0.0000 3.89045
7 196 2009 54.34013 21.51075 318.73184 373.07197 1.5 93.01387 23.63606 12.14286 NaN -2.30612 0.0000 4.36516
8 196 2010 55.58933 31.47571 234.11277 289.70210 2.0 83.64120 29.09293 15.35714 NaN 15.23056 5.2920 2.13796
9 196 2011 46.63674 25.68985 203.36784 250.00458 2.3 91.07205 43.68734 12.14286 NaN 12.07122 6.4445 2.04174
10 196 2012 53.71553 23.79141 290.52548 344.24101 1.9 101.94173 34.24928 15.71429 NaN 14.55875 0.0000 2.29087

In [3]:
res_df


Out[3]:
par non_missing test_stat std_dev mk_stat mk_p_val sen_slp
0 ESO4 11 -43 12.845233 -3.347545 0.000815 -1.457400
1 ESO4X 11 -37 12.845233 -2.880446 0.003971 -1.701966
2 ECL 11 0 12.806248 0.000000 1.000000 0.000000
3 ESO4CL 11 -3 12.845233 -0.233550 0.815335 -1.585009
4 TOC_DOC 11 35 12.662280 2.764115 0.005708 0.100000
5 ECAEMG 11 9 12.845233 0.700649 0.483522 0.423846
6 ECAXEMGX 11 9 12.845233 0.700649 0.483522 0.267245
7 ENO3 11 -35 12.767145 -2.741412 0.006118 -0.964285
8 AL 1 0 0.000000 NaN NaN NaN
9 ANC 11 43 12.845233 3.347545 0.000815 3.562240
10 ALK 11 16 9.933109 1.610775 0.107229 0.000000
11 HPLUSS 11 -33 12.845233 -2.569047 0.010198 -0.539823

2. Statistical functions

Looking at the output in the ICPW_STATISTICS3 table of RESA2, we need to calculate the following statistcs (only some of which are output by the Excel macro):

  • Number of non-missing values
  • Median
  • Mean
  • Period over which data are available (start and end years)
  • Standard deviation (of the data)
  • Standard deviation (expected under the null hypothesis of the M-K test)
  • M-K statistic
  • Normalised M-K statistic $\left(= \frac{M-K \; statistic}{Standard \; deviation} \right)$
  • M-K p-value
  • Sen's slope (a.k.a. the Theil-Sen slope)

Most of these should be quite straightforward. We'll start off by defining a function to calculate the M-K statistic (note that Scipy already has a function for the Theil-Sen slope). We'll also define another function to bundle everything together and return a dataframe of the results.


In [4]:
def mk_test(x, stn_id, par, alpha=0.05):
    """ Adapted from http://pydoc.net/Python/ambhas/0.4.0/ambhas.stats/
        by Sat Kumar Tomer.
    
        Perform the MK test for monotonic trends. Uses the "normal
        approximation" to determine significance and therefore should 
        only be used if the number of values is >= 10.
    
    Args:
        x:     1D array of data
        name:  Name for data series (string)
        alpha: Significance level
    
    Returns:
        var_s: Variance of test statistic
        s:     M-K test statistic
        z:     Normalised test statistic 
        p:     p-value of the significance test
        trend: Whether to reject the null hypothesis (no trend) at
               the specified significance level. One of: 
               'increasing', 'decreasing' or 'no trend'
    """
    import numpy as np
    from scipy.stats import norm
    
    n = len(x)
    
    if n < 10:
        print ('    Data series for %s at site %s has fewer than 10 non-null values. '
               'Significance estimates may be unreliable.' % (par, int(stn_id)))
    
    # calculate S 
    s = 0
    for k in xrange(n-1):
        for j in xrange(k+1,n):
            s += np.sign(x[j] - x[k])
    
    # calculate the unique data
    unique_x = np.unique(x)
    g = len(unique_x)
    
    # calculate the var(s)
    if n == g: # there is no tie
        var_s = (n*(n-1)*(2*n+5))/18.           
    else: # there are some ties in data
        tp = np.zeros(unique_x.shape)
        for i in xrange(len(unique_x)):
            tp[i] = sum(unique_x[i] == x)
        # Sat Kumar's code has "+ np.sum", which is incorrect
        var_s = (n*(n-1)*(2*n+5) - np.sum(tp*(tp-1)*(2*tp+5)))/18.
    
    if s>0:
        z = (s - 1)/np.sqrt(var_s)
    elif s == 0:
            z = 0
    elif s<0:
        z = (s + 1)/np.sqrt(var_s)
    else:
        z = np.nan
        
    # calculate the p_value
    p = 2*(1-norm.cdf(abs(z))) # two tail test
    h = abs(z) > norm.ppf(1-alpha/2.) 

    if (z<0) and h:
        trend = 'decreasing'
    elif (z>0) and h:
        trend = 'increasing'
    elif np.isnan(z):
        trend = np.nan
    else:
        trend = 'no trend'
        
    return var_s, s, z, p, trend

def wc_stats(raw_df, st_yr=None, end_yr=None):
    """ Calculate key statistics for the TOC trends analysis:
        
            'station_id'
            'par_id'
            'non_missing'
            'median'
            'mean'
            'std_dev'
            'period'
            'mk_std_dev'
            'mk_stat'
            'norm_mk_stat'
            'mk_p_val'
            'trend'
            'sen_slp'
    
    Args:
        raw_df:   Dataframe with annual data for a single station. Columns must 
                  be: [station_id, year, par1, par2, ... parn]
        st_yr:    First year to include in analysis. Pass None to start
                  at the beginning of the series
        end_year: Last year to include in analysis. Pass None to start
                  at the beginning of the series
    
    Returns:
        df of key statistics.
    """
    import numpy as np, pandas as pd
    from scipy.stats import theilslopes
    
    # Checking
    df = raw_df.copy()
    assert list(df.columns[:2]) == ['STATION_ID', 'YEAR'], 'Columns must be: [STATION_ID, YEAR, par1, par2, ... parn]'
    assert len(df['STATION_ID'].unique()) == 1, 'You can only process data for one site at a time'
    
    # Get just the period of interest
    if st_yr:
        df = df.query('YEAR >= @st_yr')
    if end_yr:
        df = df.query('YEAR <= @end_yr')
    
    # Get stn_id
    stn_id = df['STATION_ID'].iloc[0]
    
    # Tidy up df
    df.index = df['YEAR']
    df.sort_index(inplace=True)
    del df['STATION_ID'], df['YEAR']
    
    # Container for results
    data_dict = {'station_id':[],
                 'par_id':[],
                 'non_missing':[],
                 'median':[],
                 'mean':[],
                 'std_dev':[],
                 'period':[],
                 'mk_std_dev':[],
                 'mk_stat':[],
                 'norm_mk_stat':[],
                 'mk_p_val':[],
                 'trend':[],
                 'sen_slp':[]}
    
    # Loop over pars
    for col in df.columns:
        # 1. Station ID
        data_dict['station_id'].append(stn_id)
        
        # 2. Par ID
        data_dict['par_id'].append(col)
        
        # 3. Non-missing
        data_dict['non_missing'].append(pd.notnull(df[col]).sum())
        
        # 4. Median
        data_dict['median'].append(df[col].median())
        
        # 5. Mean
        data_dict['mean'].append(df[col].mean())
        
        # 6. Std dev
        data_dict['std_dev'].append(df[col].std())
        
        # 7. Period
        st_yr = df.index.min()
        end_yr = df.index.max()
        per = '%s-%s' % (st_yr, end_yr)
        data_dict['period'].append(per)

        # 8. M-K test
        # Drop missing values
        mk_df = df[[col]].dropna(how='any')
        
        # Only run stats if more than 1 valid value
        if len(mk_df) > 1:
            var_s, s, z, p, trend = mk_test(mk_df[col].values, stn_id, col)
            data_dict['mk_std_dev'].append(np.sqrt(var_s)) 
            data_dict['mk_stat'].append(s)
            data_dict['norm_mk_stat'].append(z)
            data_dict['mk_p_val'].append(p)
            data_dict['trend'].append(trend)      

            # 8. Sen's slope
            # First element of output gives median slope. Other results could
            # also be useful - see docs
            sslp = theilslopes(mk_df[col].values, mk_df.index, 0.95)[0]
            data_dict['sen_slp'].append(sslp)
            
        # Otherwise all NaN
        else:
            for par in ['mk_std_dev', 'mk_stat', 'norm_mk_stat', 
                        'mk_p_val', 'trend', 'sen_slp']:
                data_dict[par].append(np.nan)
    
    # Build to df
    res_df = pd.DataFrame(data_dict)
    res_df = res_df[['station_id', 'par_id', 'period', 'non_missing',
                     'mean', 'median', 'std_dev', 'mk_stat', 'norm_mk_stat',
                     'mk_p_val', 'mk_std_dev', 'trend', 'sen_slp']]    
    
    return res_df

3. Perform comparison


In [5]:
# Run analysis on test data and print results
out_df = wc_stats(raw_df)
del out_df['station_id']
out_df


Out[5]:
par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val mk_std_dev trend sen_slp
0 ESO4 2002-2012 11 58.012019 57.46313 5.862422 -43.0 -3.269696 0.001077 12.845233 decreasing -1.457400
1 ESO4X 2002-2012 11 30.684133 31.04439 6.406605 -37.0 -2.802596 0.005069 12.845233 decreasing -1.701966
2 ECL 2002-2012 11 265.319257 270.78103 41.453419 0.0 0.000000 1.000000 12.806248 no trend 0.000000
3 ESO4CL 2002-2012 11 323.331276 336.15575 43.184790 -3.0 -0.155700 0.876270 12.845233 no trend -1.585009
4 TOC_DOC 2002-2012 11 1.618182 1.50000 0.368288 35.0 2.685140 0.007250 12.662280 increasing 0.100000
5 ECAEMG 2002-2012 11 92.309261 92.07006 7.587119 9.0 0.622799 0.533417 12.845233 no trend 0.423846
6 ECAXEMGX 2002-2012 11 30.934119 29.09293 5.498533 9.0 0.622799 0.533417 12.845233 no trend 0.267245
7 ENO3 2002-2012 11 17.467532 16.78571 4.295982 -35.0 -2.663086 0.007743 12.767145 decreasing -0.964285
8 AL 2002-2012 1 116.049250 116.04925 NaN NaN NaN NaN NaN NaN NaN
9 ANC 2002-2012 11 -2.767265 -2.30612 13.596531 43.0 3.269696 0.001077 12.845233 increasing 3.562240
10 ALK 2002-2012 11 1.548045 0.00000 2.667981 16.0 1.510101 0.131018 9.933110 no trend 0.000000
11 HPLUSS 2002-2012 11 4.891747 4.36516 2.403797 -33.0 -2.491197 0.012731 12.845233 decreasing -0.539823

And below is the output from the Excel macro for comparison.


In [6]:
res_df


Out[6]:
par non_missing test_stat std_dev mk_stat mk_p_val sen_slp
0 ESO4 11 -43 12.845233 -3.347545 0.000815 -1.457400
1 ESO4X 11 -37 12.845233 -2.880446 0.003971 -1.701966
2 ECL 11 0 12.806248 0.000000 1.000000 0.000000
3 ESO4CL 11 -3 12.845233 -0.233550 0.815335 -1.585009
4 TOC_DOC 11 35 12.662280 2.764115 0.005708 0.100000
5 ECAEMG 11 9 12.845233 0.700649 0.483522 0.423846
6 ECAXEMGX 11 9 12.845233 0.700649 0.483522 0.267245
7 ENO3 11 -35 12.767145 -2.741412 0.006118 -0.964285
8 AL 1 0 0.000000 NaN NaN NaN
9 ANC 11 43 12.845233 3.347545 0.000815 3.562240
10 ALK 11 16 9.933109 1.610775 0.107229 0.000000
11 HPLUSS 11 -33 12.845233 -2.569047 0.010198 -0.539823

My code gives near-identical results to those from the Excel macro, although there are a few edge cases that might be worth investigating further. For example, if there are fewer than 10 non-null values, my code currently prints a warning. I'm not sure exactly what the Excel macro does yet, but in general it seems that for fewer than 10 values it's necessary to use a lookup table (see e.g. the Instructions sheet of the file here).

4. Get data from RESA2

The next step is to read the correct data directly from RESA2 and summarise it to look like raw_df, above. Start off by connecting to the database.


In [7]:
# Use custom RESA2 function to connect to db
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()

Looking at the ICPW_STATISTICS table in RESA2, it seems as though trends have been assessed for 14 parameters and several different time periods for each site of interest. The length and number of time periods vary from site to site, so I'll need to check with Heleen regarding how these varaibles should be chosen. The 14 parameters are as follows:

  • ESO4
  • ESO4X
  • ECl
  • ESO4Cl
  • TOC_DOC
  • ECaEMg
  • ECaXEMgX
  • ENO3
  • Al
  • ANC
  • ALK
  • HPLUS
  • ESO4EClENO3
  • ENO3DIVENO3ESO4X

Many of these quantities are unfamiliar to me, but presumably the equations for calculating them can be found in Tore's code (which I can't find at present). Check with Heleen whether all of these are still required and find equations as necessary.

The other issue is how to aggregate the values in the database from their original temporal resolution to annual summaries. I assume the median annual value is probably appropriate in most cases, but it would be good to know what Tore did previosuly.

For now, I'll focus on:

  1. Extracting the data from the database for a specified time period,

  2. Calculating the required water chemistry parameters,

  3. Taking annual medians and

  4. Estimating the trend statistics.

It should then be fairly easy to modify this code later as necessary.

4.1. Equations

Some of the quantities listed above are straightforward to calculate.

4.1.1. Micro-equivalents per litre

The Es in the parameter names are just unit conversions to micro-equivalents per litre:

$$EPAR \; (\mu eq/l) = \frac{1.10^6 * valency}{molar \; mass \; (g/mol)} * PAR \; (g/l)$$

Molar masses and valencies for the key species listed above are given in the table below.


In [8]:
# Tabulate chemical properties
chem_dict = {'molar_mass':[96, 35, 40, 24, 14],
             'valency':[2, 1, 2, 2, 1],
             'resa2_ref_ratio':[0.103, 1., 0.037, 0.196, 'N/A']}

chem_df = pd.DataFrame(chem_dict, index=['SO4', 'Cl', 'Ca', 'Mg', 'NO3-N'])
chem_df = chem_df[['molar_mass', 'valency', 'resa2_ref_ratio']]

chem_df


Out[8]:
molar_mass valency resa2_ref_ratio
SO4 96 2 0.103
Cl 35 1 1
Ca 40 2 0.037
Mg 24 2 0.196
NO3-N 14 1 N/A

4.1.2. Sea-salt corrected values

The Xs are sea-salt corrected values (also sometimes denoted with an asterisk e.g. Ca*). They are calculated by comparison to chloride concentrations, which are generall assumed to be conservative. The usual equation is:

$$PARX = PAR_{sample} - \left[ \left( \frac{PAR}{Cl} \right)_{ref} * Cl_{sample} \right]$$

where $PAR_{sample}$ and $Cl_{sample}$ are the concentrations measured in the lake or river and $\left( \frac{PAR}{Cl} \right)_{ref}$ is (ideally) the long-term average concentration in incoming rainwater. In some cases the reference values are simply taken from sea water concentrations (ignoring effects such as evaporative fractionation etc.).

I'm not sure what values to assume, but by rearranging the above equations and applying it to data extarcted from RESA2 I can back-calculate the reference values. For example, brief testing using data from Italy, Switzerland and the Czech Republic implies that RESA2 uses a standard reference value for sulphate of 0.103.

The reference ratios inferred from RESA2 for the key species listed are given in the table above.

NB: In doing this I've identified some additional erros in the database, where this correction has not beeen performed correctly. For some reason, ESO4X values have been set to zero, despite valid ESO4 and ECl measurements being available. The problem only affects a handful od sample, but could be enough to generate false trends. Return to this later?

NB2: Leah's experiences with the RECOVER project suggest that assuming a single reference concentration for all countires in the world is a bad idea. For example, I believe in e.g. the Czech Republic and Italy it is usual not to calculate sea-salt corrected concentrations at all, because most of the chloride input comes from industry rather than marine sources. Rainwater concentrations are also likely to vary dramatically from place to place, especially given the range of geographic and climatic conditions covered by this project. Check with Heleen.

4.1.3. ANC

Need to calculate this ANC, ALK, HPLUS and ENO3DIVENO3ESO4X.

4.2. Choose projects

The first step is to specify a list of RESA2 projects and get the stations associated with them.


In [9]:
# Get stations for a specified list of projects
proj_list = ['ICPW_TOCTRENDS_2015_CZ', 'ICPW_TOCTRENDS_2015_IT']

sql = ('SELECT station_id, station_code '
       'FROM resa2.stations '
       'WHERE station_id IN (SELECT UNIQUE(station_id) '
                            'FROM resa2.projects_stations '
                            'WHERE project_id IN (SELECT project_id '
                            'FROM resa2.projects '
                            'WHERE project_name IN %s))'
                            % str(tuple(proj_list)))

stn_df = pd.read_sql(sql, engine)

stn_df


Out[9]:
station_id station_code
0 23499 CZ01
1 23500 CZ02
2 23501 CZ03
3 23502 CZ04
4 23503 CZ05
5 23504 CZ06
6 33325 CZ07
7 37745 CZ08
8 33326 CZ09

4.3. Extract time series

The next step is to get time series for the desired parameters for each of these stations.


In [10]:
# Specify parameters of interest
par_list = ['SO4', 'Cl', 'Ca', 'Mg', 'NO3-N', 'TOC', 'Al']

if 'DOC' in par_list:
    print ('The database treats DOC and TOC similarly.\n'
           'You should probably enter "TOC" instead')
    
# Check pars are valid
if len(par_list)==1:  
    sql = ("SELECT * FROM resa2.parameter_definitions "
           "WHERE name = '%s'" % par_list[0])
else:
    sql = ('SELECT * FROM resa2.parameter_definitions '
           'WHERE name in %s' % str(tuple(par_list)))
    
par_df = pd.read_sql_query(sql, engine)

assert len(par_df) == len(par_list), 'One or more parameters not valid.'

In [11]:
# Get results for ALL pars for sites and period of interest
if len(stn_df)==1:
    sql = ("SELECT * FROM resa2.water_chemistry_values2 "
           "WHERE sample_id IN (SELECT water_sample_id FROM resa2.water_samples "
                               "WHERE station_id = %s)"
                               % stn_df['station_id'].iloc[0])    
else:
    sql = ("SELECT * FROM resa2.water_chemistry_values2 "
           "WHERE sample_id IN (SELECT water_sample_id FROM resa2.water_samples "
                               "WHERE station_id IN %s)"
                               % str(tuple(stn_df['station_id'].values)))
    
wc_df = pd.read_sql_query(sql, engine)

# Get all sample dates for sites and period of interest
if len(stn_df)==1:
    sql = ("SELECT water_sample_id, station_id, sample_date "
           "FROM resa2.water_samples "
           "WHERE station_id = %s " % stn_df['station_id'].iloc[0])    
else:
    sql = ("SELECT water_sample_id, station_id, sample_date "
           "FROM resa2.water_samples "
           "WHERE station_id IN %s " % str(tuple(stn_df['station_id'].values)))
    
samp_df = pd.read_sql_query(sql, engine)

# Join in par IDs based on method IDs
sql = ('SELECT * FROM resa2.wc_parameters_methods')
meth_par_df = pd.read_sql_query(sql, engine)

wc_df = pd.merge(wc_df, meth_par_df, how='left',
                 left_on='method_id', right_on='wc_method_id')

# Get just the parameters of interest
wc_df = wc_df.query('wc_parameter_id in %s' % str(tuple(par_df['parameter_id'].values)))

# Join in sample dates
wc_df = pd.merge(wc_df, samp_df, how='left',
                 left_on='sample_id', right_on='water_sample_id')

# Join in parameter units
sql = ('SELECT * FROM resa2.parameter_definitions')
all_par_df = pd.read_sql_query(sql, engine)

wc_df = pd.merge(wc_df, all_par_df, how='left',
                 left_on='wc_parameter_id', right_on='parameter_id')

# Join in station codes
wc_df = pd.merge(wc_df, stn_df, how='left',
                 left_on='station_id', right_on='station_id')

# Convert units
wc_df['value'] = wc_df['value'] * wc_df['conversion_factor']

# Extract columns of interest
wc_df = wc_df[['station_id', 'sample_date', 'name', 'value']]

# Unstack
wc_df.set_index(['station_id', 'sample_date', 'name'], inplace=True)
wc_df = wc_df.unstack(level='name')
wc_df.columns = wc_df.columns.droplevel()
wc_df.reset_index(inplace=True)
wc_df.columns.name = None

wc_df.head()


Out[11]:
station_id sample_date Al Ca Cl Mg NO3-N SO4 TOC
0 23499 1984-06-04 929.0 1.08 0.90 0.65 NaN 6.81 3.9
1 23499 1984-08-04 840.0 1.05 0.92 0.60 964.0 6.51 4.2
2 23499 1985-06-05 851.0 1.01 0.89 0.58 NaN 6.85 NaN
3 23499 1986-06-06 918.0 1.07 0.99 0.56 NaN 7.01 NaN
4 23499 1986-08-06 795.0 1.01 1.07 0.54 NaN 6.77 NaN

4.4. Aggregate to annual


In [12]:
# Extract year from date column
wc_df['year'] = wc_df['sample_date'].map(lambda x: x.year)
del wc_df['sample_date']

# Groupby station_id and year
grpd = wc_df.groupby(['station_id', 'year'])

# Calculate median
wc_df = grpd.agg('median')

wc_df.head()


Out[12]:
Al Ca Cl Mg NO3-N SO4 TOC
station_id year
23499 1984 884.5 1.065 0.91 0.625 964.0 6.660 4.05
1985 851.0 1.010 0.89 0.580 NaN 6.850 NaN
1986 856.5 1.040 1.03 0.550 NaN 6.890 NaN
1987 750.0 1.100 1.03 0.540 NaN 6.620 NaN
1988 795.0 1.010 0.79 0.555 NaN 6.075 NaN

4.4. Convert units and apply sea-salt correction

I haven't calculated all 14 parameters here, as I'm not sure exactly what they all are. The ones I'm reasonably certain of are included below.


In [13]:
# 1. Convert to ueq/l
for par in ['SO4', 'Cl', 'Mg', 'Ca', 'NO3-N']:
    val = chem_df.ix[par, 'valency']
    mm = chem_df.ix[par, 'molar_mass']
    
    if par == 'NO3-N':
        wc_df['ENO3'] = wc_df[par] * val / mm
    else:
        wc_df['E%s' % par] = wc_df[par] * val * 1000. / mm

# 2. Apply sea-salt correction
for par in ['ESO4', 'EMg', 'ECa']:
    ref = chem_df.ix[par[1:], 'resa2_ref_ratio']
    wc_df['%sX' % par] = wc_df[par] - (ref*wc_df['ECl'])
    
# 3. Calculate combinations
# 3.1. ESO4 + ECl
wc_df['ESO4_ECl'] = wc_df['ESO4'] + wc_df['ECl']

# 3.2. ECa + EMg
wc_df['ECa_EMg'] = wc_df['ECa'] + wc_df['EMg']

# 3.3. ECaX + EMgX
wc_df['ECaX_EMgX'] = wc_df['ECaX'] + wc_df['EMgX']

# 3.4. ESO4 + ECl + ENO3
wc_df['ESO4_ECl_ENO3'] = wc_df['ESO4'] + wc_df['ECl'] + wc_df['ENO3']

# 4. Delete unnecessary columns and tidy
for col in ['SO4', 'Cl', 'Mg', 'Ca', 'NO3-N']:
    del wc_df[col]

wc_df.reset_index(inplace=True)
    
wc_df.head()


Out[13]:
station_id year Al TOC ESO4 ECl EMg ECa ENO3 ESO4X EMgX ECaX ESO4_ECl ECa_EMg ECaX_EMgX ESO4_ECl_ENO3
0 23499 1984 884.5 4.05 138.750000 26.000000 52.083333 53.25 68.857143 136.072000 46.987333 52.288000 164.750000 105.333333 99.275333 233.607143
1 23499 1985 851.0 NaN 142.708333 25.428571 48.333333 50.50 NaN 140.089190 43.349333 49.559143 168.136905 98.833333 92.908476 NaN
2 23499 1986 856.5 NaN 143.541667 29.428571 45.833333 52.00 NaN 140.510524 40.065333 50.911143 172.970238 97.833333 90.976476 NaN
3 23499 1987 750.0 NaN 137.916667 29.428571 45.000000 55.00 NaN 134.885524 39.232000 53.911143 167.345238 100.000000 93.143143 NaN
4 23499 1988 795.0 NaN 126.562500 22.571429 46.250000 50.50 NaN 124.237643 41.826000 49.664857 149.133929 96.750000 91.490857 NaN

In [14]:
def process_water_chem_df(stn_df, wc_df, st_yr=None, end_yr=None):
    """ Calculate statistics for the stations, parameters and time
        periods specified.
    
    Args:
        stn_df:   Dataframe of station_ids
        wc_df:    Dataframe of water chemistry time series for stations
                  and parameters of interest
        st_yr:    First year to include in analysis. Pass None to start
                  at the beginning of the series
        end_year: Last year to include in analysis. Pass None to start
                  at the beginning of the series
    
    Returns:
        Dataframe of statistics
    """
    # Container for output
    df_list = []

    # Loop over sites
    for stn_id in stn_df['station_id']:
        # Extract data for this site
        df = wc_df.query('station_id == @stn_id')

        # Modify col names
        names = list(df.columns)
        names[:2] = ['STATION_ID', 'YEAR']
        df.columns = names

        # Run analysis
        df_list.append(toc_stats(df, st_yr=st_yr, end_yr=end_yr))

    res_df = pd.concat(df_list, axis=0)
    
    return res_df
    
res_df = process_water_chem_df(stn_df, wc_df)

res_df.head()


Out[14]:
station_id par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val mk_std_dev trend sen_slp
0 23499 Al 1984-2014 31 493.670533 487.500000 217.901691 -341.0 -5.778782 7.524345e-09 58.835930 decreasing -21.000000
1 23499 TOC 1984-2014 22 1.953409 1.775000 0.744290 66.0 1.833595 6.671420e-02 35.449495 no trend 0.043750
2 23499 ESO4 1984-2014 31 90.795274 88.020833 32.177584 -429.0 -7.274466 3.477219e-13 58.835930 decreasing -3.523551
3 23499 ECl 1984-2014 31 20.778903 21.142857 4.846152 -311.0 -5.270412 1.361179e-07 58.818931 decreasing -0.441558
4 23499 EMg 1984-2014 31 39.522452 38.750000 5.930091 -355.0 -6.022534 1.717076e-09 58.779248 decreasing -0.583333

This seems to be working OK so far, but I need to do some more testing to see that my results more-or-less agree with those calculated previously by Tore. As a start, let's compare the results above with those in the ICPW_STATISTICS3 table of RESA2, which is where (I think) Tore has saved his previous output.


In [15]:
# Get results for test sites from RESA2
sql = ('SELECT * FROM resa2.icpw_statistics3 '
       'WHERE station_id IN %s' 
       % str(tuple(stn_df['station_id'].values)))

stat_df = pd.read_sql(sql, engine)

# Get just the cols to compare to my output
stat_df = stat_df[['station_id', 'parameter', 'period', 'nonmiss',
                   'average', 'median', 'stdev', 'test_stat', 
                   'mk_stat', 'mkp', 'senslope']]

stat_df.head(14).sort_values(by='parameter')


Out[15]:
station_id parameter period nonmiss average median stdev test_stat mk_stat mkp senslope
8 23499 AL 1990-2004 15 486.052433 503.500000 186.876929 -83 -4.107435 0.000040 -32.700000
10 23499 ALK 1990-2004 14 -15.896429 -15.850000 6.270918 20 1.096542 0.272842 0.500000
9 23499 ANC 1990-2004 12 -47.948231 -53.510865 14.813228 48 3.291482 0.000997 3.178531
5 23499 ECAEMG 1990-2004 15 84.136540 83.398845 9.581870 -75 -3.711537 0.000206 -1.968997
6 23499 ECAXEMGX 1990-2004 15 79.221717 78.338350 9.248877 -71 -3.513589 0.000442 -1.886138
2 23499 ECL 1990-2004 15 21.093655 21.718895 2.980660 -44 -2.180106 0.029250 -0.435917
7 23499 ENO3 1990-2004 12 65.111360 66.036608 7.299949 -40 -2.742902 0.006090 -1.234397
13 23499 ENO3DIVENO3ESO4X 1990-2004 12 0.428599 0.427500 0.028221 32 2.194322 0.028212 0.006281
0 23499 ESO4 1990-2004 15 95.643491 97.541580 18.305102 -97 -4.800255 0.000002 -4.089638
3 23499 ESO4CL 1990-2004 15 116.737145 118.934785 20.274570 -95 -4.701281 0.000003 -4.449103
12 23499 ESO4ECLENO3 1990-2004 12 175.508512 176.306256 23.280313 -52 -3.565772 0.000363 -5.589727
1 23499 ESO4X 1990-2004 15 93.470844 95.253690 18.117947 -97 -4.800255 0.000002 -4.033607
11 23499 HPLUSS 1990-2004 15 18.383278 18.506625 4.313829 -69 -3.414614 0.000639 -0.795074
4 23499 TOC_DOC 1990-2004 11 1.561364 1.385000 0.554342 4 0.312348 0.754776 0.009286

For e.g. site 23499, I can now re-run my code for the period from 1990 to 2004 and compare my results to those above.


In [16]:
# Re-run python analysis for the period 1990 - 2004
res_df = process_water_chem_df(stn_df, wc_df, st_yr=1990, end_yr=2004)

# Delete mk_std_dev as not relevant here
del res_df['mk_std_dev']

res_df.head(14).sort_values(by='par_id')


Out[16]:
station_id par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val trend sen_slp
0 23499 Al 1990-2004 15 485.085767 503.500000 187.586374 -83.0 -4.057948 0.000050 decreasing -33.200000
5 23499 ECa 1990-2004 15 44.283333 44.000000 5.584470 -75.0 -3.662050 0.000250 decreasing -1.166667
9 23499 ECaX 1990-2004 15 43.489771 43.186000 5.524900 -75.0 -3.662050 0.000250 decreasing -1.179175
12 23499 ECaX_EMgX 1990-2004 15 79.786038 78.874000 9.366281 -71.0 -3.464102 0.000532 decreasing -1.899667
11 23499 ECa_EMg 1990-2004 15 84.783333 84.000000 9.708540 -73.0 -3.563076 0.000367 decreasing -1.981481
3 23499 ECl 1990-2004 15 21.447619 22.000000 3.017861 -44.0 -2.130559 0.033126 decreasing -0.441558
4 23499 EMg 1990-2004 15 40.500000 40.000000 4.170237 -75.0 -3.671052 0.000242 decreasing -0.833333
8 23499 EMgX 1990-2004 15 36.296267 35.930000 3.924888 -67.0 -3.266153 0.001090 decreasing -0.739152
6 23499 ENO3 1990-2004 12 64.866071 66.037500 7.380028 -40.0 -2.674329 0.007488 decreasing -1.278393
2 23499 ESO4 1990-2004 15 95.625000 98.125000 18.527533 -95.0 -4.651794 0.000003 decreasing -4.083333
7 23499 ESO4X 1990-2004 15 93.415895 95.947286 18.336954 -95.0 -4.651794 0.000003 decreasing -3.965619
10 23499 ESO4_ECl 1990-2004 15 117.072619 119.267857 20.531061 -95.0 -4.651794 0.000003 decreasing -4.457792
13 23499 ESO4_ECl_ENO3 1990-2004 12 175.649554 176.662202 23.824901 -50.0 -3.360055 0.000779 decreasing -5.648674
1 23499 TOC 1990-2004 11 1.561364 1.385000 0.554342 4.0 0.234261 0.814783 no trend 0.009286

The numbers in the above two tables are almost identical, which is actually pretty remarkable given that I'm second-guessing a lot of the decisions in Tore's analysis (having never seen his code) and also recoding everything from scratch. It certainly looks as though this will be a viable alternative for re-running the trends analysis.

There are still some loose ends to tie up. In particular, I need to add a few more parameters to the trends calculation, but that shouldn't be difficult once I've spoken to Heleen to find out what they are and how to calculate them. In the meantime, I'm sufficiently happy with this output to move the code into a separate module and then continue to explore the data in a new notebook.

NB: A nice way to visualise these results would be to create a google map, where each point is coloured according to 'increasing', 'decreasing' or 'no trend'. A pop-up on each point could then give the main summary statistics and a time series plot with the Sen's slope line overlaid. This would result in lots of points on top on each other, but users could filter the map to just show one parameter at a time to avoid things becoming too cluttered.