In [1]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os
from sqlalchemy import create_engine
sn.set_context('notebook')

Updated TOC trends analysis (part 2)

This notebook begins to explore trends in the broader water chemistry dataset collated by Heleen, Don and John for the updated TOC analysis. Note that the results presented here should be considered preliminary as there are still some outstanding database "cleaning" tasks that I haven't found time for yet. The reason I'm jumping ahead of myself here is because I suspect the best way to identify any further data issues is to attempt the trend analysis and see what problems crop up.

My original plan - as agreed with Heleen, Don and John in May - was to tidy up the data as much as possible and then simply re-run Tore's code for the trends calculations. Unfortunately, although I've managed to find most of the necessary code (either within RESA2 or as VBA projects within linked Access databases), I have been unable to locate some of the crucial subroutines, including the ones for the bulk of the statistical analysis. In addition, as far as I can tell, the code relies on an old, third-party Excel macro for the statistical calculations, and using this involves a lot of data shuffling (first from RESA2 to Excel, then into Access and finally back into RESA2), which is quite slow and difficult to keep track of.

What I have found in RESA2 is a table called ICPW_STATISTICS3, which appears to store the output of Tore's analysis. I think my best option is therefore to recode the whole thing from scratch, and then compare my results with those in Tore's table to make sure what I'm doing is more-or-less compatible with what's been done before. This will involve digging into the internlas of RESA2, which will hopefuily be a good learning experience for me as well.

1. Trend analysis code

This section provides an overview of the code I've written for the updated trends analysis. The code was initially developed in an earlier iPython notebook and later moved into a .py file to allow the trends functionality to be imported into other notebooks. This Python file is here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Python\toc_trends_analysis.py

NB: The code in the Python file is slightly different and more sophisticated than in the original notebook. Use the Python file (not the notebook) as the basis for any further developments.

To run an analysis, the user must provide a list of valid RESA2 project names, plus start and end years for the period of interest. Optionally, it is possible to specify a folder where trend plots for each parameter will be stored. The code then performs the following operations:

  1. Identifies all the monitoring stations associated with the specified projects.

  2. Generates a list of water samples and sampling dates for all these stations within the period of interest. If some stations have no samples within the specified period, a warning message is printed and the list of stations with no data is included in the output.

    Note that RESA2 sometimes includes measurements for samples collected at several different water depths. The code currently only selects near-surface water samples (depth $\leq 1 \; m$). Check this with Heleen.

  3. Extracts the data for the key trends parameters (currently 'SO4', 'Cl', 'Ca', 'Mg', 'NO3-N', 'TOC' and 'Al', but this need amending - see below) for the selected water samples and converts the raw values from the units and methods originally specified into the RESA2's "standard" units (as specified in the RESA2.PARAMETER_DEFINITIONS table).

  4. Checks for duplicate measurements of the same parameter at the same sampling point and date. If duplicates are found, the values are averaged and a warning message is printed. A list of duplicated records is then included in the output to facilitate data checking.

  5. Sampled values are aggregated from the native data collection interval to annual frequency by taking medians.

  6. For the parameters 'SO4', 'Cl', 'Mg', 'Ca' and 'NO3-N', concentrations are recalculated as $\mu eq/l$ (denoted by the prefix 'E' e.g. ESO4).

    $$EPAR \; (\mu eq/l) = \frac{1.10^6 * valency}{molar \; mass \; (g/mol)} * PAR \; (g/l)$$

  7. Sea-salt corrections are then applied for the parameters 'ESO4', 'EMg' and 'ECa'] (denoted by the suffix 'X' e.g. ESO4X).

    $$EPARX = EPAR_{sample} - \left[ \left( \frac{EPAR}{ECl} \right)_{ref} * ECl_{sample} \right]$$

    NB: I'm not sure what reference ratios were used in the original analysis. I've attempted to back-calculate them from RESA2, and it looks as though a single value has been assumed worldwide for each parameter? Check this with Heleen. Also, see section 4.1.2 of the code development notebook for more details.

  8. Calculates combined parameters e.g. ESO4_ECl is calculated as $(ESO4 + ECl)$ etc.

    At present, the code generates annual time series for the following parameters:

    • ESO4 ($\mu eq/l$)
    • ESO4X ($\mu eq/l$)
    • ECl ($\mu eq/l$)
    • ESO4_ECl ($\mu eq/l$)
    • ECa_EMg ($\mu eq/l$)
    • ECaX_EMgX ($\mu eq/l$)
    • ENO3 ($\mu eq/l$)
    • ESO4_ECl_ENO3 ($\mu eq/l$)
    • TOC ($mg C/l$)
    • Al ($mg/l$)

    NB: It looks as though Tore's code calculates a few additional parameters as well:

    • ANC
    • ALK
    • HPLUS
    • ENO3_DIV_ENO3_ESO4X

    These will be easy to add, but I'd like to check exactly which parameters are of interest and how they should be calculated before including them.

  9. Performs statistical analysis for each parameter at each site for the period specified. The output includes the following summary statistics:

    • Period over which data are available (i.e. start and end years, which are not always the same as the period originally specified)
    • Number of non-missing values
    • Median of data values
    • Mean of data values
    • Standard deviation of data values
    • Standard deviation expected under the null hypothesis of the Mann-Kendall (M-K) test
    • M-K statistic
    • Normalised M-K statistic $\left(= \frac{M-K \; statistic}{Expected \; standard \; deviation} \right)$
    • M-K p-value
    • Sen's slope (a.k.a. the Theil-Sen slope). Optionally, a plot of the trend line can be produced.

    Note that the algorithm uses the normal approximation to estimate the variance of the S parameter in the M-K test (and thereby the significance of any trends). This approximation is only robust when the number of non-null data points in the time series is $\geq 10$. If the number of non-missing values is fewer than 10, the code prints a warning message to indicate significance estimates may be unreliable.

  10. The output from the algorithm consists of (i) a dataframe of summary statistics for each parameter at each site over the period of interest; (ii) a list of stations with no data in the period specified and (iii) a list of duplicated values (where the database contains two measurements of the same parameter at the same location on the same day). If the plot option is set to True when the function is called, the code will also output (iv) plots of the Theil-Sen regression line for each parameter, saved to the specified output folder.

2. Illustrative example and testing

This section illustrates how the code can be used to estimate trends. As a basic test, I'll start by reading some results from Tore's ICPW_STATISTICS3 table, which I believe stores the output from his statistical analyses. I'll then run my code for the same stations and time periods to make sure the results are comparable.

NB: My earlier notebook focussing on the code development includes some more rigorous testing.

2.1. Import modules and connect to database


In [2]:
# Import custom functions
# Connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
                    r'\useful_resa2_code.py')

resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)

engine, conn = resa2_basic.connect_to_resa2()

# Import code for trends analysis
resa2_trends_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
                     r'\Python\toc_trends_analysis.py')

resa2_trends = imp.load_source('toc_trends_analysis', resa2_trends_path)

2.2. Read previous results from RESA2

As an example, we'll extract some of Tore's results for one of the Czech sites (station_id = 23499) for the period from 1990 to 2004.


In [3]:
# Get results for test sites from RESA2
sql = ("SELECT * FROM resa2.icpw_statistics3 "
       "WHERE station_id = 23499 "
       "AND period = '1990-2004'")

old_df = pd.read_sql(sql, engine)

# Get just the cols to compare to my output
old_df = old_df[['station_id', 'parameter', 'period', 'nonmiss',
                   'average', 'median', 'stdev', 'test_stat', 
                   'mk_stat', 'mkp', 'senslope']]

old_df.head(14).sort_values(by='parameter')


Out[3]:
station_id parameter period nonmiss average median stdev test_stat mk_stat mkp senslope
0 23499 AL 1990-2004 15 486.052433 503.500000 186.876929 -83 -4.107435 0.000040 -32.700000
1 23499 ALK 1990-2004 14 -15.896429 -15.850000 6.270918 20 1.096542 0.272842 0.500000
2 23499 ANC 1990-2004 12 -47.948231 -53.510865 14.813228 48 3.291482 0.000997 3.178531
3 23499 ECAEMG 1990-2004 15 84.136540 83.398845 9.581870 -75 -3.711537 0.000206 -1.968997
4 23499 ECAXEMGX 1990-2004 15 79.221717 78.338350 9.248877 -71 -3.513589 0.000442 -1.886138
5 23499 ECL 1990-2004 15 21.093655 21.718895 2.980660 -44 -2.180106 0.029250 -0.435917
6 23499 ENO3 1990-2004 12 65.111360 66.036608 7.299949 -40 -2.742902 0.006090 -1.234397
7 23499 ENO3DIVENO3ESO4X 1990-2004 12 0.428599 0.427500 0.028221 32 2.194322 0.028212 0.006281
8 23499 ESO4 1990-2004 15 95.643491 97.541580 18.305102 -97 -4.800255 0.000002 -4.089638
9 23499 ESO4CL 1990-2004 15 116.737145 118.934785 20.274570 -95 -4.701281 0.000003 -4.449103
10 23499 ESO4ECLENO3 1990-2004 12 175.508512 176.306256 23.280313 -52 -3.565772 0.000363 -5.589727
11 23499 ESO4X 1990-2004 15 93.470844 95.253690 18.117947 -97 -4.800255 0.000002 -4.033607
12 23499 HPLUSS 1990-2004 15 18.383278 18.506625 4.313829 -69 -3.414614 0.000639 -0.795074
13 23499 TOC_DOC 1990-2004 11 1.561364 1.385000 0.554342 4 0.312348 0.754776 0.009286

2.3. Run new code

The above table shows output from the previous analysis for the period from 1990 to 2004. The code below runs my new trend analysis for all of the Czech sites (project_name = 'ICPW_TOCTRENDS_2015_CZ') for the same period and then prints just the results for site 23499 for comparison.


In [4]:
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CZ',]

# Run analysis for the period 1990 - 2004
new_df, dup_df, no_data_df = resa2_trends.run_trend_analysis(proj_list, engine,
                                                             st_yr=1990, end_yr=2004)

# Delete mk_std_dev col as not relevant here
del new_df['mk_std_dev']

new_df.head(14).sort_values(by='par_id')


Extracting data from RESA2...
    Done.

Converting units and applying sea-salt correction...
    Done.

Calculating statistics...
    Done.

Finished.
Out[4]:
station_id par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val trend sen_slp
0 23499 Al 1990-2004 15 485.085767 503.500000 187.586374 -83.0 -4.057948 0.000050 decreasing -33.200000
5 23499 ECa 1990-2004 15 44.283333 44.000000 5.584470 -75.0 -3.662050 0.000250 decreasing -1.166667
9 23499 ECaX 1990-2004 15 43.489771 43.186000 5.524900 -75.0 -3.662050 0.000250 decreasing -1.179175
12 23499 ECaX_EMgX 1990-2004 15 79.786038 78.874000 9.366281 -71.0 -3.464102 0.000532 decreasing -1.899667
11 23499 ECa_EMg 1990-2004 15 84.783333 84.000000 9.708540 -73.0 -3.563076 0.000367 decreasing -1.981481
3 23499 ECl 1990-2004 15 21.447619 22.000000 3.017861 -44.0 -2.130559 0.033126 decreasing -0.441558
4 23499 EMg 1990-2004 15 40.500000 40.000000 4.170237 -75.0 -3.671052 0.000242 decreasing -0.833333
8 23499 EMgX 1990-2004 15 36.296267 35.930000 3.924888 -67.0 -3.266153 0.001090 decreasing -0.739152
6 23499 ENO3 1990-2004 12 64.866071 66.037500 7.380028 -40.0 -2.674329 0.007488 decreasing -1.278393
2 23499 ESO4 1990-2004 15 95.625000 98.125000 18.527533 -95.0 -4.651794 0.000003 decreasing -4.083333
7 23499 ESO4X 1990-2004 15 93.415895 95.947286 18.336954 -95.0 -4.651794 0.000003 decreasing -3.965619
10 23499 ESO4_ECl 1990-2004 15 117.072619 119.267857 20.531061 -95.0 -4.651794 0.000003 decreasing -4.457792
13 23499 ESO4_ECl_ENO3 1990-2004 12 175.649554 176.662202 23.824901 -50.0 -3.360055 0.000779 decreasing -5.648674
1 23499 TOC 1990-2004 11 1.561364 1.385000 0.554342 4.0 0.234261 0.814783 no trend 0.009286

The results from my code are not exactly the same as the output produced by Tore, but for all practical purposes I'd say the differences are negligible. This is actually pretty surprising given the amount of second-guessing and reverse-engineering that went into developing my code.

The next big question is whether my code will upscale effectively to the full TOC dataset. The RESA2 application usually crashes if a user tries to extract data for more than one large project at a time, but I'm hoping that by bypassing RESA2 and communicating directly with the underlying Oracle instance, my code will not be affected by this issue.

This section attempts to run the trends analysis on the full TOC dataset. Initially I'll use the data for all years and I'll also generate plots for each series (which I'll return to later). The projects to be included have been previously agreed with Heleen - see section 3 of this notebook for details.


In [5]:
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CA_ATL',
             'ICPW_TOCTRENDS_2015_CA_DO',
             'ICPW_TOCTRENDS_2015_CA_ICPW',
             'ICPW_TOCTRENDS_2015_CA_NF',
             'ICPW_TOCTRENDS_2015_CA_QU',
             'ICPW_TOCTRENDS_2015_CZ',
             'ICPW_TOCTRENDS_2015_Cz2',
             'ICPW_TOCTRENDS_2015_FI',
             'ICPW_TOCTRENDS_2015_NO',
             'ICPW_TOCTRENDS_2015_SE',
             'ICPW_TOCTRENDS_2015_UK',
             'ICPW_TOCTRENDS_2015_US_LTM',
             'ICPWaters Ca']

# Folder for saving PNG plots
plot_fold=r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Trends_Plots'

# Run analysis
res_df, dup_df, no_data_df = resa2_trends.run_trend_analysis(proj_list, engine,
                                                             st_yr=None, end_yr=None,
                                                             plot=True, fold=plot_fold)

# Delete mk_std_dev col as not relevant here
del res_df['mk_std_dev']


Extracting data from RESA2...
    The database contains duplicate values for some station-date-parameter combinations.
    These will be averaged, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Some stations have no relevant data in the period specified. Their IDs are returned in a separate dataframe.

    Done.

Converting units and applying sea-salt correction...
    Done.

Calculating statistics...
    Data series for Al at site 101 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 102 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 103 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 104 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 107 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 109 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 112 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 115 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 118 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 119 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 120 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 121 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 122 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 123 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 128 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 132 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 134 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 135 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 144 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 146 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 147 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 150 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 156 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 158 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 161 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 162 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 163 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 166 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 168 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 170 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 173 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 176 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 179 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 180 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 181 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 182 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 183 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 185 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 192 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 193 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 196 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 12081 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 23546 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36547 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36560 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Done.

Finished.

3.1. Results

It's good to see the algorithm manages to process all the sites in one go. The table below shows the first 10 rows of the output, but it's the warning messages that I'm most interested in at the moment.


In [6]:
res_df.head(10)


Out[6]:
station_id par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val trend sen_slp
0 100.0 Al 1986-2015 1 276.433655 276.433655 NaN NaN NaN NaN NaN NaN
1 100.0 TOC 1986-2015 28 6.053929 6.000000 1.277101 221.0 4.351249 1.353642e-05 increasing 0.100000
2 100.0 ESO4 1986-2015 28 37.526786 31.250000 18.794670 -327.0 -6.446919 1.141465e-10 decreasing -2.001984
3 100.0 ECl 1986-2015 28 18.061224 17.142857 3.572260 -120.0 -2.369597 1.780750e-02 decreasing -0.184694
4 100.0 EMg 1986-2015 28 10.711310 10.833333 2.028790 -134.0 -2.652231 7.996170e-03 decreasing -0.133025
5 100.0 ECa 1986-2015 28 18.607143 18.750000 3.090029 -126.0 -2.474722 1.333399e-02 decreasing -0.187500
6 100.0 ENO3 1986-2015 28 2.418367 2.000000 1.526747 41.0 0.791033 4.289247e-01 no trend 0.028571
7 100.0 ESO4X 1986-2015 28 35.666480 29.778571 18.611860 -329.0 -6.481403 9.087331e-11 decreasing -1.945556
8 100.0 EMgX 1986-2015 28 7.171310 6.860667 1.698371 -123.0 -2.410766 1.591906e-02 decreasing -0.102902
9 100.0 ECaX 1986-2015 28 17.938878 18.068143 3.040012 -122.0 -2.390539 1.682367e-02 decreasing -0.182602

3.2. Data issues

3.2.1. Limited Al data

Many of the sites have limited Al data. This isn't a database error, but it could have implications for our ability to detect trends for this parameter.

3.2.2. Sites with no data


In [7]:
# Get properties for sites with no data
# Basic station properties
sql = ('SELECT station_id, station_code, station_name '
       'FROM resa2.stations '
       'WHERE station_id in %s' 
       % str(tuple(no_data_df['station_id'].values)))

na_stns = pd.read_sql(sql, engine)

# Get country for each station
sql = ('SELECT station_id, value '
       'FROM resa2.stations_par_values '
       'WHERE station_id in %s '
       'AND var_id = 261'
       % str(tuple(no_data_df['station_id'].values)))

co_df = pd.read_sql(sql, engine)

# Decode special characters fro `windows-1252` encoding to unicode
na_stns['station_name'] = na_stns['station_name'].str.decode('windows-1252')

# Join
na_stns = pd.merge(na_stns, co_df, how='left',
                   on='station_id')

na_stns.columns = ['station_id', 'station_code', 'station_name', 'country']

print 'Number of sites with no data:', len(no_data_df)
print 'Sites with no data come from the following countries:'
print list(na_stns.country.unique())

na_stns.head()


Number of sites with no data: 178
Sites with no data come from the following countries:
['Sweden']
Out[7]:
station_id station_code station_name country
0 36573 656612-164132 Årsjön Sweden
1 36574 655275-153234 Älgsjön Sweden
2 36577 656984-164254 Albysjön Sweden
3 36579 665768-164748 Aspdalssjön Sweden
4 36580 656832-161545 Aspen Sweden

So there are 178 Swedish sites (out of 261 in total) that have no data whatsoever for the parameters $SO_4$, $Cl$, $Ca$, $Mg$, $NO_3$, $TOC$ or $Al$. It seems strange to have these included in a project focussing on TOC trends! Perhaps they're included due to having data for other useful parameters? Let's check.


In [8]:
# Find ANY parameters associated with the 178 Swedish sites
sql = ('SELECT * '
       'FROM resa2.water_samples '
       'WHERE station_id in %s' 
       % str(tuple(no_data_df['station_id'].values)))

swed_df = pd.read_sql(sql, engine)

print 'Number of samples associated with the 178 Swedish sites (for ALL parameters):', len(swed_df)


Number of samples associated with the 178 Swedish sites (for ALL parameters): 0

So it appears that more than two-thirds of the Swedish sites in the database have no data at all. Is this expected, or has something gone wrong here?

3.2.3. Duplicated values

Some duplicate values are expected, because at some sites samples are taken at a variety of depths. My algorithm currently only selects samples from the upper 1 m of the water column, but this occasionally spans two depth measuremnts (usually 0 and 0.5 m). For the moment, it seems reasonable to average these values (and in most cases they are near-identical anyway), but check this with Heleen and modify the code if necessary.


In [9]:
print 'Total number of duplicated records:', len(dup_df)

dup_df.head(10)


Total number of duplicated records: 2617
Out[9]:
station_id sample_date name value
1270 103.0 1974-10-29 Ca 1.25
57232 103.0 1974-10-29 Ca 1.25
1271 103.0 1974-10-29 Cl 4.40
57233 103.0 1974-10-29 Cl 4.40
1272 103.0 1974-10-29 Mg 0.74
57234 103.0 1974-10-29 Mg 0.74
1273 103.0 1974-10-29 NO3-N 90.00
57235 103.0 1974-10-29 NO3-N 90.00
6 104.0 1974-11-05 Ca 2.07
57242 104.0 1974-11-05 Ca 2.07

Not all the duplicates can be explained by measuring at multiple depths, though, and occasionally the repeated values are significantly different. As an example, consider records for site 28970 (which corresponds to site code NF02YH0013, in Newfoundland, Canada).


In [10]:
# Get the duplicated values for this site
dup_df.query('station_id == 28970').head(10)


Out[10]:
station_id sample_date name value
125050 28970.0 1990-06-29 NO3-N 25.0000
125053 28970.0 1990-06-29 NO3-N 110.0000
125052 28970.0 1990-06-29 TOC 0.3125
125054 28970.0 1990-06-29 TOC 0.6400
125059 28970.0 1990-10-18 NO3-N 23.0000
125062 28970.0 1990-10-18 NO3-N 100.0000
125061 28970.0 1990-10-18 TOC 0.8750
125063 28970.0 1990-10-18 TOC 0.8960
125068 28970.0 1991-06-19 NO3-N 25.0000
379302 28970.0 1991-06-19 NO3-N 110.0000

There are lots of duplciated records here, and the values are not the same. Let's have a look at the database results for the first water sample in this series (the one from 29/06/1990), where we apparently have repeated measurements for NO3-N and TOC.


In [11]:
# Get all methods applied to sample(s) from 29/06/1990 at station 28970
sql = ("SELECT value_id, sample_id, method_id, value, flag1 "
       "FROM resa2.water_chemistry_values2 "
       "WHERE sample_id IN (SELECT water_sample_id FROM resa2.water_samples "
                           "WHERE station_id = 28970 "
                           "AND sample_date = DATE '1990-06-29')")

df = pd.read_sql(sql, engine)
df


Out[11]:
value_id sample_id method_id value flag1
0 7440578 328339 10249 79.0000 None
1 3621153 328339 10251 0.2200 None
2 3621154 328339 10253 2.5000 None
3 7440581 328339 10256 15.0000 None
4 3621155 328339 10258 0.1400 None
5 7440583 328339 10260 17.1000 None
6 3621156 328339 10261 0.2200 None
7 3621157 328339 10263 1.5000 None
8 7440586 328339 10265 25.0000 None
9 3621158 328339 10268 5.1000 None
10 3621159 328339 10271 1.0000 None
11 3621160 328339 10273 0.3125 None
12 7440589 328339 10274 100.0000 None
13 7440591 328339 10298 -6.0000 None
14 3621161 328339 10308 0.1100 None
15 3621162 328339 10311 -0.3000 <
16 7440590 328339 10823 0.5000 None

Note that all these records have the same sample_id, which means the problem is not related to samples being entered into the database more than once. Instead, the problem seems to be due to entering multiple methods for the same parameter: method_id = 10265 corresponds to NO3-N concentrations measured in $\mu g/l$, whereas method_id = 10308 refers to NO3-N in $mg/l$. As the data is extracted for the trends analysis, my code automatically converts all NO3-N measurements into $\mu g/l$, which is why we end up with duplicate values of 25 and 110 in the trends results (see above). The puzzle is where do these values come from in the first place? $110 \; \mu g/l$ is, after all, a lot more than $25 \; \mu g/l$.

Looking at the database log, the value of $0.110 \; mg/l$ was uploaded on 16/02/2006, whereas the value of $25 \; \mu g/l$ was added on 16/11/2015. The only raw data I can find on the network for this location is here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\2015 DOC analysis\data delivery\CA\Couture\ICP Waters form for water chemistry_ATL_NF.xlsx

which gives the nitrate concentraion as $25 \; \mu g/l$ (i.e. consistent with the most recent database upload). At present, I can't identify where the older values have come from, and I'm reluctant to delete them without further clarification. Ask Heleen about this, but for now I'll just stick with averaging duplicated values because I have no obvious way of choosing which one(s) are correct.

4. Data visualisation

I'd like to try producing a Google map to visualise the results of the statistical analysis. If I automate this as much as possible, it should be easy to re-run the code once I'm happy with the basic input data.

I'll start off by limiting the dataset so that we're only visualising the following parameters (and only in cases where there are more than 10 data values in the series):

  • ESO4 (μeq/l)
  • ESO4X (μeq/l)
  • ESO4X (μeq/l)
  • ECl (μeq/l)
  • ESO4_ECl (μeq/l)
  • ECa_EMg (μeq/l)
  • ECaX_EMgX (μeq/l)
  • ENO3 (μeq/l)
  • ESO4_ECl_ENO3 (μeq/l)
  • TOC (mgC/l)
  • Al (mg/l)

In [19]:
# Extract subset of results to visualise
par_str = str(('ESO4', 'ESO4X', 'ESO4X', 
               'ECl', 'ESO4_ECl', 'ECa_EMg', 
               'ECaX_EMgX', 'ENO3', 'ESO4_ECl_ENO3', 
               'TOC', 'Al'))

vis_df = res_df.query("(par_id in %s) and (non_missing >= 10)" % par_str)

print 'Total number of stations to visualise:', len(vis_df['station_id'].unique())

vis_df.head()


Total number of stations to visualise: 428
Out[19]:
station_id par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val trend sen_slp
1 100.0 TOC 1986-2015 28 6.053929 6.000000 1.277101 221.0 4.351249 1.353642e-05 increasing 0.100000
2 100.0 ESO4 1986-2015 28 37.526786 31.250000 18.794670 -327.0 -6.446919 1.141465e-10 decreasing -2.001984
3 100.0 ECl 1986-2015 28 18.061224 17.142857 3.572260 -120.0 -2.369597 1.780750e-02 decreasing -0.184694
6 100.0 ENO3 1986-2015 28 2.418367 2.000000 1.526747 41.0 0.791033 4.289247e-01 no trend 0.028571
7 100.0 ESO4X 1986-2015 28 35.666480 29.778571 18.611860 -329.0 -6.481403 9.087331e-11 decreasing -1.945556

In [20]:
# Join in station details

# Basic station properties
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
       'FROM resa2.stations '
       'WHERE station_id in %s' 
       % str(tuple(vis_df['station_id'].unique())))

vis_stns = pd.read_sql(sql, engine)

# Get country for each station
sql = ('SELECT station_id, value '
       'FROM resa2.stations_par_values '
       'WHERE station_id in %s '
       'AND var_id = 261'
       % str(tuple(vis_df['station_id'].unique())))

co_df = pd.read_sql(sql, engine)

# Decode special characters fro `windows-1252` encoding to unicode
vis_stns['station_name'] = vis_stns['station_name'].str.decode('windows-1252')

# Join
vis_stns = pd.merge(vis_stns, co_df, how='left',
                    on='station_id')

vis_stns.columns = ['station_id', 'station_code', 'station_name', 
                    'latitude', 'longitude', 'country']

# Join to stats output
vis_df = pd.merge(vis_df, vis_stns, how='left',
                  on='station_id')

# Reorder
vis_df = vis_df[['station_id', 'station_code', 'station_name', 'country',
                 'latitude', 'longitude', 'par_id', 'period', 'non_missing',
                 'mean', 'median', 'std_dev', 'mk_stat', 'norm_mk_stat',
                 'mk_p_val', 'trend', 'sen_slp']]

vis_df.head()


Out[20]:
station_id station_code station_name country latitude longitude par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val trend sen_slp
0 100.0 623-603 Breidlivatnet Norway 59.977669 10.152037 TOC 1986-2015 28 6.053929 6.000000 1.277101 221.0 4.351249 1.353642e-05 increasing 0.100000
1 100.0 623-603 Breidlivatnet Norway 59.977669 10.152037 ESO4 1986-2015 28 37.526786 31.250000 18.794670 -327.0 -6.446919 1.141465e-10 decreasing -2.001984
2 100.0 623-603 Breidlivatnet Norway 59.977669 10.152037 ECl 1986-2015 28 18.061224 17.142857 3.572260 -120.0 -2.369597 1.780750e-02 decreasing -0.184694
3 100.0 623-603 Breidlivatnet Norway 59.977669 10.152037 ENO3 1986-2015 28 2.418367 2.000000 1.526747 41.0 0.791033 4.289247e-01 no trend 0.028571
4 100.0 623-603 Breidlivatnet Norway 59.977669 10.152037 ESO4X 1986-2015 28 35.666480 29.778571 18.611860 -329.0 -6.481403 9.087331e-11 decreasing -1.945556

In section 3, I generated a very large number of plots. Only some of them are relevant, so to save storage space online I'll delete the ones that aren't relevant to the data in vis_df.


In [14]:
# Delete unnecessary plots
# Folder path
png_fold = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Trends_Plots'

# Paths to keep
keep_list = []
for rec in vis_df.itertuples():
    stn_id = rec.station_id
    par_id = rec.par_id
    period = rec.period
    
    keep_path = os.path.join(png_fold,
                             '%s_%s_%s.png' % (int(stn_id), par_id, period))
    keep_list.append(keep_path)

# Get a list of files in folder
search_path = os.path.join(png_fold, '*.png')
file_list = glob.glob(search_path)
                             
# Loop over files and delete where necessary
del_list = []
for file_path in file_list:
    if file_path not in keep_list:
        os.remove(file_path)

I've uploaded the remaining plots to a location online, which means I can link to them from the map visualisation. To do this, I need to insert the link to each plot as a new column in vis_df. I also want to add a column to specify the correct Google Fusion Tables marker symbol according to the nature of the trend and, finally, I'd like to capitalise the trend column to improve the formating on my finsihed map. The whole thing then needs to be saved as a CSV.


In [21]:
# Capitalise 'trend' column
vis_df['trend'] = vis_df['trend'].str.capitalize()

# Join in text for GFT markers
mark_df = pd.DataFrame({'trend':['Increasing', 'No trend', 'Decreasing'],
                        'symbol':['small_red', 'small_yellow', 'small_green']})

vis_df = pd.merge(vis_df, mark_df, how='left',
                  on='trend')

# Build the link path
link_stem = r'http://www.googledrive.com/host/0BximeC_RweaebnFoa2VWcGtRWms/'

vis_df['link'] = (link_stem + vis_df['station_id'].map(int).map(str) + '_' +
                  vis_df['par_id'] + '_' + vis_df['period'] + '.png')

# Save 
out_csv = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Trends_Plots\vis_data.csv'
vis_df.to_csv(out_csv, index=False, encoding='windows-1252')

I've converted the output from all this into a Google Fusion Table, which can be found here.

5. Summary to date

This section summarises the work I've done so far with RESA2. The sub-headings can be compared to my RESA2 "to do" list, which can be found on the network here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\JamesS\RESA2_To_Do_HWI.docx

5.1. Review of data

I've worked my way through all the ICP-Waters data in the database and have cleaned things up where possible. The notebooks here and here describe the progress to date and the main outstanding issues. In particular, I've deleted and reloaded a large proportion of the Czech water samples, many of which were associated with the wrong site codes. I've also identified a number of sites (mostly Swedish and Canadian) for which we have incomplete site metadata and/or very limited water chemistry measurements. Where data are missing, I've contacted the relevant Focal Centres to determine whether more complete information exists and can be made available.

There are still a number of datasets on the network which have not been added to the database. Most of these are associated with locations where the main data contact is no longer very responsive (e.g. Hungary or Slovakia) and where there are issues regarding the raw data that has been supplied. In addition, datasets for Russia and the Netherlands have been supplied relatively recently, but in a format that will require significant work before they can be added to the database. I haven't got around to this yet.

5.2. Review US sites

I've tidied up the US sites in the database based on the information in John's e-mails - full details are in this notebook. There are stil a few (fairly minor) outstanding issues, but I've contacted John about these.

5.3. Land cover proportions and mean catchment elevation

I've added this information where possible and have also included a request for improved site metadata in the 2016 "call for data". I've already had responses from e.g. Italy and Switzerland, which have allowed me to substantially improve the station descriptions for these countires. Hopefully others will respond similarly later in the year.

For the Norwegian sites, I've done a bit of watershed processing to refine the catchment properties (see section 2 of this notebook), but there is still a substantial amount of work required before we have a complete set of metadata for the Norwegian catchments. Heleen thinks this may be worth doing at some point, but it's on hold for now due to lack of time and resources.

5.4. Correct Newfoundland data

This has been done, as described in section 5 of this notebook.

5.5. Add "country" as a metadata variable

Done.

5.6. Calculate distance from coast

Done - see section 4 of this notebook and section 1 of this notebook for details.

Done. The map can be accessed here and includes a simple visualisation of the results of the trend analyses (see below).

5.8. Run the new trend analysis

As described here and in this notebook, I've attempted to recode the entire trends analysis. This work is not finished yet (I need to extend the code to include a few more parameters), but the basic framework is complete and seems to be working OK. The code is easy to re-run and so, as long as Heleen, Don and John are happy with the baisc output, it should be fairly straightforward to repeat and adapt the trends analysis and necessary. It would be useful to have a discussion on how the output could be modified and extended to make it more useful.

5.9. Send out 2016 "call for data"

I've e-mailed the Focal Centre contacts and have had several replies already (e.g. from Italy and Switzerland). All the new data I've received has been added to the database.

To check the progress of the 2016 call for any particular country, see the contacts sheet of the Excel file here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016\icpw_all_sites.xlsx

which keeps track of all the messages I've sent and the replies I've received to date.