In [1]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')

TOC trends October 2016

Heleen would like updated results for the ICPW trends analysis at the beginning of October. Today is the last day of September and I'm committed to other projects for the next few weeks, so time is getting tight! The aim of this notebook is to get some more-or-less complete results that Heleen can work with while I'm away next week.

Note: There are still a number of significant issues with the database that I am unlikely to have time to fix in the near future. Although the ICPW dataset itself is pretty simple, cleaning up the existing system is not a small job, due to a long history of database structural development and, in particular, the way that RESA2 is coupled to various other systems. I'm not going to attempt to deal with these issues here, but I'll try to describe them below so I can come back in the future. The aim here is to try to produce something useful despite the outstanding known (and unknown?) issues.

1. ICPW database summary

Most of the tasks disucssed with John, Don and Heleen back in May have been completed, as described here, here and here. Section 5 of the latter notebook provides a more detailed summary.

On 25/08/2016, John sent a couple of very helpful e-mails in response to my earlier questions regarding the US sites. Unfortunately, comparing John's spreadsheets to the information in RESA2 highlighted a number of further database-wide errors (not specific to the US sites), which are going to be tricky to fix. A notebook describing this work in detail is here, but the main implications are summarised in an e-mail to Heleen (sent 02/09/2016 at 11:42), which is reproduced in part below.

"...it is not possible to reliably extract complete water chemistry time series from RESA2, even for just a single chemical parameter at a single site. This is because, if you choose a site that appears in both the core ICPW programme and in the wider trends analysis (which is most of them), you will find two time series in the database: one associated with 'ICPWaters' and another linked to 'TOC_TRENDS'. These two series will likely have different start and end points and different periods of missing data. There will also be a lot of overlap, and most of the overlapping values will agree, but in a few places they will be substantially different. What's more, at present the database will often report significantly different site properties for the two time series (i.e. different station names, geographic co-ordinates and catchment properties), despite the fact that the samples actually all come from a single location. All this means that if you want to look at a complete water chemistry time series for any site in ICP Waters, the only way to do it at the moment is to extract both series, manually merge them (e.g. in Excel) and then try to figure out which set of site properties is correct.

[...]

At present it is possible for someone using RESA2 to extract two different time series and two sets of site properties for a single location. This is pretty confusing (as we found out back in May), and it also somewhat defeats the point of having a database."

Other outstanding issues include the large number of Swedish sites with no data, problems with the Czech data (in an e-mail received 08/09/2016, Vladimir suggested just deleting and reloading everything) and the poor availability of site metadata for many locations.

Having discussed these issues with Heleen, I've decided to try the following:

  • Tidy up the US sites based on information provided by John. I'm not going to attempt to merge any datasets at this stage, but I can clean up the site names and add the original USEPA site codes to reduce confusion.

  • Look at the Czech data sent by Vladimir and decide whether it can and should replace the values currently in the database.

  • Run the trends analysis using whatever data is currently associated with the TOC_TRENDS projects. In principle, these datasets were gathered together separately from ICPW for the 2015 reanalysis, which is why they're associated with separate projects in the database. As described in the notebook linked above, the values in these series sometimes do not agree with those reported for the core ICPW projects, but I'm going to ignore this for now as reconciling the differences will take a long time.


In [2]:
# Import custom functions and connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
                    r'\useful_resa2_code.py')

resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)

engine, conn = resa2_basic.connect_to_resa2()

2. Tidy up US sites

Based on the results in the previous notebook, I'd like to go through all the US sites in RESA2 and modify the site properties to match the values in John's spreadsheet (which I'm taking to be definitive). In particular, I want to correct the station names and geographic co-ordinates, as well as appending the original USEPA site codes to the station metadata.

2.1. Core ICPW sites

We'll start by correcting the core ICPW sites. John's spreadsheet identified 95 sites that should be associated with this project and, following the work in the previous notebook, this now agrees with what's in RESA2. Four of the core ICPW sites in John's spreadsheet are marked NA in the station_code column. These sites do now have codes in RESA2, so I've created a new version of John's spreadsheet (U.S.Site.Reconciliation.August.2016_jes.xlsx) with these codes added. The first step is therefore to check that there is a direct match between codes for the 95 core sites in RESA2 and the the 95 sites listed in John's spreadsheet.


In [3]:
# Read John's spreadsheet
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
           '\Replies\usa\U.S.Site.Reconciliation.August.2016_jes.xlsx')
j_df = pd.read_excel(in_xls, sheetname='Sheet1')

# Get a subset of columns from John's spreadsheet
us_df = j_df[['Station Code', 'Station name', 'Suggested Name',
              'Active', 'INCLUDE IN ICP WATERS DATABASE', 
              'INCLUDE IN ICP DOC ANALYSIS', 'NFC_SITEID',
              'NFC_SITENAME', 'Latitude', 'Longitude', 'Altitude (m)']]

# Rename columns
us_df.columns = ['station_code', 'station_name', 'suggested_name',
                 'active', 'core', 'trend', 'nfc_code', 'nfc_name', 
                 'latitude', 'longitude', 'elevation']

# Just the core sites
core_df = us_df.query('core == "YES"')
core_df.head()


Out[3]:
station_code station_name suggested_name active core trend nfc_code nfc_name latitude longitude elevation
0 US24 New York, Catskill Mnt., Rondout Creek Rondout Creek, Catskills YES YES YES 01364959 RONDOUT CREEK 41.9367 -74.3764 503.0
1 US25 W Br Neversink R At Winnisook, Catskills West Branch Neversink River at Winnisook, Cats... YES YES YES 01434021 W BR NEVERSINK R AT WINNISOOK 41.9725 -74.4483 816.0
2 US26 Biscuit Brook, Catskills Biscuit Brook, Catskills YES YES YES 01434025 BISCUIT BROOK 42.0111 -74.4147 634.0
3 US23 New York, Catskill Mnt., E. Branch Neversink, ... East Branch Neversink River, Catskills YES YES YES 0143400680 EAST BRANCH NEVERSINK, HEADWAT 41.9869 -74.5031 652.0
4 US27 Little Hope Pond, Adirondacks Little Hope Pond, Adirondacks YES YES NO, road salt contamination 020058O LITTLE HOPE POND 44.5158 -74.1252 521.0

In [37]:
# Get station codes associated with 'ICP Waters US' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 1679)')
r2_df = pd.read_sql_query(sql, engine)
r2_df.head()


Out[37]:
station_id station_code station_name latitude longitude
0 23659 US24 New York, Catskill Mnt., Rondout Creek 41.9600 -74.4500
1 23658 US23 New York, Catskill Mnt., E. Branch Neversink, ... 43.1800 -74.5000
2 23660 US25 W Br Neversink R At Winnisook, Catskills 42.0111 -74.4147
3 23661 US26 Biscuit Brook, Catskills 41.9869 -74.5031
4 23662 US27 Little Hope Pond, Adirondacks 44.5158 -74.1252

In [30]:
# Join
df = r2_df.merge(core_df, on='station_code', how='outer',
                 suffixes=('_r', '_j'))

if len(df) == 95:
    print 'All rows match.'
    
df.head()


All rows match.
Out[30]:
station_id station_code station_name_r latitude_r longitude_r station_name_j suggested_name active core trend nfc_code nfc_name latitude_j longitude_j elevation
0 23659 US24 New York, Catskill Mnt., Rondout Creek 41.9600 -74.4500 New York, Catskill Mnt., Rondout Creek Rondout Creek, Catskills YES YES YES 01364959 RONDOUT CREEK 41.9367 -74.3764 503.0
1 23658 US23 New York, Catskill Mnt., E. Branch Neversink, ... 43.1800 -74.5000 New York, Catskill Mnt., E. Branch Neversink, ... East Branch Neversink River, Catskills YES YES YES 0143400680 EAST BRANCH NEVERSINK, HEADWAT 41.9869 -74.5031 652.0
2 23660 US25 W Br Neversink R At Winnisook, Catskills 42.0111 -74.4147 W Br Neversink R At Winnisook, Catskills West Branch Neversink River at Winnisook, Cats... YES YES YES 01434021 W BR NEVERSINK R AT WINNISOOK 41.9725 -74.4483 816.0
3 23661 US26 Biscuit Brook, Catskills 41.9869 -74.5031 Biscuit Brook, Catskills Biscuit Brook, Catskills YES YES YES 01434025 BISCUIT BROOK 42.0111 -74.4147 634.0
4 23662 US27 Little Hope Pond, Adirondacks 44.5158 -74.1252 Little Hope Pond, Adirondacks Little Hope Pond, Adirondacks YES YES NO, road salt contamination 020058O LITTLE HOPE POND 44.5158 -74.1252 521.0

The code above shows that all the sites have been matched correctly, so I can now loop over each site in the ICP Waters US project, updating the details in the stations table using the information in John's spreadsheet.

I also want to add the original National Focal Centre site code as an additional attribute (as John suggested in his e-mails). The easiest way to do this is to create a new entry (NFC_Code) in the STATION_PARAMETER_DEFINITIONS table.


In [48]:
# Loop over sites
for row in df.iterrows():
    # Get site properties
    stn_id = row[1]['station_id']
    name = row[1]['suggested_name']
    lat = row[1]['latitude_j']
    lon = row[1]['longitude_j']
    elev = row[1]['elevation']
    nfc = row[1]['nfc_code']

    # Update stations table
    sql = ("UPDATE resa2.stations "
           "SET station_name = '%s', "
               "latitude = %s, "
               "longitude = %s, "
               "altitude = %s "
           "WHERE station_id = %s" 
           % (name, lat, lon, elev, stn_id))

    result = conn.execute(sql)

    # Update stations_par_values table with NFC code
    sql = ("INSERT INTO resa2.stations_par_values "
           "(station_id, var_id, value, entered_by, entered_date) "
           "VALUES (%s, 321, '%s', 'JES', TO_DATE('2016-09-30', 'YYYY-MM-DD'))"
           % (stn_id, nfc))

    result = conn.execute(sql)

The next step is to correct the entries for the trends sites. Note that if the database was properly normalised this step wouldn't be necessary, as the 76 trends sites are a sub-set of the 95 core ICPW sites, so the steps described above should cover all relevant sites. However, due to the duplication of sites in the database, it is necessary to do this cleaning twice.

Matching the trends sites turns out to be a bit more difficult, because of inconsistencies in the X15: prefixes. The first step is to manually add Newbert Pond, Maine (station_code = US82) to the ICPW_TOCTRENDS_2015_US_LTM project, as mentioned at the end of the previous notebook. This should mean that we have 76 sites associated with the trends project, as specified in John's spreadsheet.


In [4]:
# Just the trend sites
trend_df = us_df.query('trend == "YES"')
trend_df.head()


Out[4]:
station_code station_name suggested_name active core trend nfc_code nfc_name latitude longitude elevation
0 US24 New York, Catskill Mnt., Rondout Creek Rondout Creek, Catskills YES YES YES 01364959 RONDOUT CREEK 41.9367 -74.3764 503.0
1 US25 W Br Neversink R At Winnisook, Catskills West Branch Neversink River at Winnisook, Cats... YES YES YES 01434021 W BR NEVERSINK R AT WINNISOOK 41.9725 -74.4483 816.0
2 US26 Biscuit Brook, Catskills Biscuit Brook, Catskills YES YES YES 01434025 BISCUIT BROOK 42.0111 -74.4147 634.0
3 US23 New York, Catskill Mnt., E. Branch Neversink, ... East Branch Neversink River, Catskills YES YES YES 0143400680 EAST BRANCH NEVERSINK, HEADWAT 41.9869 -74.5031 652.0
5 US28 Big Hope Pond, Adirondacks Big Hope Pond, Adirondacks YES YES YES 020059O BIG HOPE POND 44.5119 -74.1250 522.0

In [3]:
# Get station codes associated with 'ICPW_TOCTRENDS_2015_US_LTM' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 3870)')

r2_df = pd.read_sql_query(sql, engine)

print 'Number of sites:', len(r2_df)

r2_df.head()


Number of sites: 76
Out[3]:
station_id station_code station_name latitude longitude
0 36983 X15:020059O Big Hope Pond, Adirondacks 44.5119 -74.1250
1 36984 X15:020138O East Copperas Pond, Adirondacks 44.3119 -74.3722
2 36985 X15:020188E Sunday Pond, Adirondacks 44.3447 -74.3005
3 36986 X15:020197E Sochia Pond, Adirondacks 44.3522 -74.2947
4 36987 X15:020265O Marcy Dam Pond, Adirondacks 44.1589 -73.9530

The next question is how well we can match these sites to John's spreadsheet. Based on the analysis in the previous notebook, the answer is, "not very well", because leading and trailing zeros in the original site codes have been truncated (presumably accidently, e.g. in Excel) prior to the X15: prefix being added. This isn't necessarily a major problem - the station codes used in RESA2 are mostly irrelevant - but I do need to somehow match them to John's spreadsheet and then update the station properties (preferably also adding the original site codes supplied by John, so we don't have to go through all this again later). The code below takes a slightly "heuristic" approach to finding matches.


In [121]:
# Loop over rows from John
for row in trend_df.iterrows():
    # Get site properties
    nfc_cd = row[1]['nfc_code']
    name = row[1]['suggested_name']
    lat = row[1]['latitude']
    lon = row[1]['longitude']
    elev = row[1]['elevation']  
        
    # Attempt to find match. Need to add 'X15:' and allow for variants
    q_res = r2_df[(r2_df['station_code']=='X15:%s' % nfc_cd) |
                  (r2_df['station_code']=='X15:%s' % nfc_cd[1:]) |
                  (r2_df['station_code']=='X15:%s' % nfc_cd[:-1])]
       
    if len(q_res) == 1:
        # Single match found. Get stn_id
        stn_id = q_res.iloc[0]['station_id']

        # Update stations table
        sql = ("UPDATE resa2.stations "
               "SET station_name = '%s', "
                   "latitude = %s, "
                   "longitude = %s, "
                   "altitude = %s "
               "WHERE station_id = %s" 
               % (name, lat, lon, elev, stn_id))

        result = conn.execute(sql)


        # Check whether there's already an entry for this site
        # in stations_par_values table
        sql = ('SELECT * FROM resa2.stations_par_values '
               'WHERE station_id = %s '
               'AND var_id = 321'
               % stn_id)
        
        df = pd.read_sql_query(sql, engine)
        
        if len(df) < 1:
            # Update stations_par_values table with NFC code        
            sql = ("INSERT INTO resa2.stations_par_values "
                   "(station_id, var_id, value, entered_by, entered_date) "
                   "VALUES (%s, 321, '%s', 'JES', TO_DATE('2016-09-30', 'YYYY-MM-DD'))"
                   % (stn_id, nfc_cd))

            result = conn.execute(sql)
       
    else:
        # Can't get good match
        print "Can't match %s." % nfc_cd


Can't match 1E1-134E.
Can't match ME-9998E.

This code manages to find unique matches for all but two of the sites, which is a good start. Looking at the site codes for the two exceptions in John's spreadsheet, it seems as though they were previously only associated with core ICPW project and not the broader trends analysis. They were therefore not duplicated when the TOC Trends projects were created and instead only appear in the database once, using station codes US74 and US82, respectively (rather than any of the X15: stuff).

2.3. Testing

With a bit of luck, I've finally managed to sort out the basic details for the US sites. Let's check.


In [4]:
# Get the NFC site codes
sql = ('SELECT station_id, value AS nfc_code '
       'FROM resa2.stations_par_values '
       'WHERE var_id =321')
nfc_df = pd.read_sql_query(sql, engine)

# Get station codes associated with 'ICP Waters US' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude, altitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 1679)')
core_df = pd.read_sql_query(sql, engine)

# Get station codes associated with 'ICPW_TOCTRENDS_2015_US_LTM' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude, altitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 3870)')
trend_df = pd.read_sql_query(sql, engine)

# Join in original site codes
core_df = core_df.merge(nfc_df, on='station_id', how='left')
trend_df = trend_df.merge(nfc_df, on='station_id', how='left')

print 'Sites in core ICPW project:', len(core_df)
print 'Sites in trends project:   ', len(trend_df)


Sites in core ICPW project: 95
Sites in trends project:    76

Regardless of the station codes and IDs in RESA2, I should now be able to make sense of the US data using the actual USEPA codes. For example, the sites in trend_df should be a true sub-set of those in core_df, and all the site properties should agree.


In [9]:
# Inner join dfs
df = trend_df.merge(core_df, on='nfc_code', how='inner', 
                    suffixes=('_t', '_c'))

# Testing
assert len(df) == 76, 'Incorrect number of sites.'

for item in ['station_name', 'latitude', 'longitude', 'altitude']:
    assert (df[item + '_t'] == df[item + '_c']).all(), 'Mismatch in %ss.'

print 'Check complete. All properties match.'


Check complete. All properties match.

Great - I think the US LTM sites are now more-or-less sorted. Only another 20 countires to go! Note that, with the changes made above, it is now possible to extract the NFC station codes from RESA2 in the same way as for the other station properties (i.e. using the Additional station data tab). There are still some issues to be aware of though. In particular, if you choose to export station properties from RESA2 to Excel, the RESA2 code will convert any all-numeric NFC codes to numbers. The result is that NFC codes such as 013425 are truncated as 12345. This is not a problem with the database (the values in Oracle are correct) - it is a problem with the RESA2 code that copies results from the database into Excel. I'm not going to delve into this at the moment as I'm keen to avoid becoming too involved with the RESA2 application itself. As a workaround, it is safer to export from RESA2 as CSV and then import the CSV into Excel, taking care to set the column type for the nfc_code field to Text during the import process.

3. Czech data

Vladimir has suggested deleting all the Czech data and then uploading it again from scratch. I'm reluctant to delete more than a decade's worth of data, but an alternative option more consistent with what's been done before would be to rename the existing Czech site codes and shift them into and EXCLUDED project. I can then upload the new Czech data using the same site codes as previously, which will hopefully avoid the issues created when the trend data was similarly uploaded, but using modified (i.e. X15:) prefixes.

Firstly, some points to note:

  • The "old" database has Czech data for 9 sites. Two of these, Lysina (CZ07) and Pluhuv Bor (CZ09) have higher resolution data than the rest. See Section 3 of this notebook for full details.

  • In his latest data submission, Vladimir has only supplied monthly resolution data for sites CZ01 to CZ08 inclusive (i.e. excluding Pluhuv Bor). This is exactly what's required for the core ICPW dataset, but we may wish to include some of old (weekly) data from Lysina and Pluhuv Bor in the trends analysis (assuming it's considered good enough to use).

For the moment, I propose shifting all of the existing Czech data (sites CZ01 to CZ09) into a new project called ICPWaters CZ Excl. I will then upload the new data for sites CZ01 to CZ08 and associate these records with both the core ICPW project and the trend analysis work. Both projects will therefore access exactly the same (monthly resolution) data, which is more consistent than the data structure used at present. The downsides are that the trends project will then make use of lower resolution (monthly rather than weekly) data for Lysina (CZ07) and have no data at all for Pluhuv Bor (CZ09). Given that we are estimating trends from annual averages anyway, I don't think the differences in temporal resolution are a problem (in fact, using monthly data is arguably better as it's more consistent with our statistical analyses elsewhere). It is also possible to include Pluhuv Bor in the trends project if we want to, but Jakub has previously stated that it is a very well-buffered catchment that does not suffer from acidification (see e-mail received 29/06/2016 at 10:58), so I'm not sure it's appropriate anyway? Check with Heleen.

3.1. Restructure Czech projects

The first step is to create a new project called ICPWaters CZ Excl. I can then rename all the existing Czech site codes (CZ01 etc.) by adding the suffix _Old, followed by shifting them over to the newly created project. The second step is to create 8 new sites (with identical properties to the old ones), assigning them the same site codes as used previously (CZ01 etc.). These new sites have no data associated with them, so I should be able to upload the revised data from Vladimir without creating any conflicts (I hope). All of this is most easily done manually using Access.

Note: This process involves creating duplicate sites and is therefore superficially similar to the duplication described above, which has already caused lots of confusion. However, these problems have primarily been caused because we have two sets of sites (core and trends), which represent the same locations but which often have different site properties. More importantly, both of these projects are considered "active", and data is periodically appended to update them. This is what causes the problems: we have two supposedly identical datasets evolving in parallel, and over time differences emerge that are difficult to correct.

For the Czech sites, I'm moving all the old data into an EXCLUDED project, marking the site names as OLD, and also adding a description to the STATIONS table saying they are Associated with data supplied prior to August 2016. In principle, I'd be happy to delete all of this data entirely (which would remove any concerns about duplication and database normalisation etc.), but I don't want to lose any information if I can avoid it. Besides, deleting records from RESA2 is not straightforward, due to all the other database interactions (a topic I'd like to avoid getting involved in for the moment). My hope is that the changes I'm making here will not cause further confusion, because the sites with the suffix _Old will be discontinued completely i.e. no new data will be associated with them and they won't be used in subsequent projects. This should avoid the messy situation that we're now in with sites from other countries.

3.2. Upload new data

With the new project structure in place, I can use upload_icpw_template.ipynb to load the latest data from Vladimir into the database. This seems to have worked successfully, but for some reason the available parameters for the new sites are not showing up properly in the RESA2 application. They are there, and the values appear to be correct, but the Refresh parameter list button is not working as it used to. However, if the STANDARD button is pressed to select the routine parameters of interest, everything works as expected and the data can be exported to Excel as usual. The bit that's missing is that the user is no longer presented with the option of selecting custom parameters. I'm not sure what's happening here - another of the mysteries of RESA2! I'll have to ask Tore where his code pulls this information from and modify the database accordingly, but I'm not going to worry about this for now, as all my code will interact directly with the underlying Oracle database itself.

3.3. Data checking

To check that my changes have worked as expected, I want to compare the data now in the database with what's in Vladimir's spreadsheet. To do this, I've manually extracted time series for the 8 Czech sites from RESA2 and saved them in check_czech.xlsx. The code below plots these values from RESA2 against the raw values in Valdimir's spreadsheet, which hopefully should agree.


In [29]:
# Read data
in_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
            '\Replies\czech_republic\check_czech.xlsx')
r2_df = pd.read_excel(in_xlsx, sheetname='RESA')
vl_df = pd.read_excel(in_xlsx, sheetname='Orig')

In [35]:
# Define params of interest
params = ['pH', 'Ca', 'Mg', 'Na', 'K', 'Cl', 'SO4', 'ALK', 'TOC']

fig, axes = plt.subplots(nrows=9, ncols=8, figsize=(15, 20))

# Loop over data
for row, param in enumerate(params):
    # Get data
    df1 = r2_df[['Code', 'Date', param]]
    df2 = vl_df[['Code', 'Date', param]]
    
    # Pivot
    df1 = df1.pivot(index='Date', columns='Code', values=param)
    df2 = df2.pivot(index='Date', columns='Code', values=param)
    
    # Join
    df = df1.merge(df2, how='outer', 
                   left_index=True, right_index=True)
    
    for site in range(1, 9):
        # Get data for this site
        s_df = df[['CZ%02d_RESA' % site, 'CZ%02d_Orig' % site]]
        s_df.dropna(how='any', inplace=True)
        
        # Plot
        s_df.plot(ax=axes[row, site-1], legend=False)
        axes[row, site-1].set_title('CZ%02d %s' % (site, param))

plt.tight_layout()        
plt.show()


C:\Data\64_Bit_WinPython\python-2.7.10.amd64\lib\site-packages\ipykernel\__main__.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

This looks proimising: there is only one colour visible on these plots, but two lines are being plotted, so I assume the results from RESA are overlapping exactly with those from Vladimir's spreadsheet.

The final step for this notebook is to update my trends code and re-run the analysis. Note that there are still some known issues with the data, so these results should not be taken as definitive, but hopefully they are a step in the right direction.

My previous trends notebook left a number of loose ends, which need tidying up here:

  • My current code averages samples taken from 0 and 0.5 m depth. In his e-mail from 21/07/2016 at 14:54, Don recommended only using samples from 0.5 m. Changing this will require some careful coding, as many of the samples seem to have depth = 0 as a kind of default, so if I select only those records where depth = 0.5, I'll end up missing a large chunk of the data. My algorithm needs to first select all the surface water data, and then choose the 0.5 m sample only in cases where there are duplicates (i.e. results for both 0.5 m and 0.0 m).

  • My original code used reference ratios for sea-salt corrections which I back-calculated from RESA2. In an e-mail received 16/08/2016 at 14:04 Heleen provided the true values used for the original analysis. I think these agree with my back-calculated values, but this needs checking.

  • My code does not currently calculate trends for the correct parameters. Based on Heleen's e-mail (see above), the main quantities of interest are:

    • ESO4 ($μeq/l$)
    • ESO4X ($μeq/l$)
    • ECl ($μeq/l$)
    • ESO4_ECl ($μeq/l$)
    • ECa_EMg ($μeq/l$)
    • ECaX_EMgX ($μeq/l$)
    • ENO3 ($μeq/l$)
    • TOC ($mgC/l$)
    • Al ($mg/l$)
    • ANC ($= Ca + Mg + K + Na + NH_4 - Cl - SO_4 - NO_3$, all in $μeq/l$)
    • ALK ($μeq/l$)
    • HPLUS ($μeq/l$)

  • Deal with duplicates. The previous notebook highlighted some other duplicates for which I couldn't find an explanation (see section 3.2.3). Suzanne hasn't responsed to my e-mails, so I'm not in a position to correct these issues at this stage. For now, I propose to take the most recently uploaded of the duplicate values, as these are usually the ones for which I can find a "paper-trail".

  • Choose a method for dealing with values at the detection limit. There are no hard-and-fast rules here and the choice is unlikely to dramatically influence trend results calculated using non-parametric statistics (i.e. Sen's slope calculated from annual medians). At present, my code simply substitutes the detection limit value, which is the easiest and most transparent approach. I can change this if there are strong arguments for using an alternative method, though.

4.1.1. Depths

Some more careful checking of the database shows that this is actually a very minor problem. Records with measurements from both 0 m and 0.5 m only occur for a few of the Norwegian sites back in the 1970s. For these measurements, the chemistry values are the same regardless of which depth is chosen, so the problem is actually negligible.

The code below finds all of the duplicate depth associated with all the ICPW trends sites.The issue is so limitied that I don't think it needs further consideration.


In [5]:
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CA_ATL',
             'ICPW_TOCTRENDS_2015_CA_DO',
             'ICPW_TOCTRENDS_2015_CA_ICPW',
             'ICPW_TOCTRENDS_2015_CA_NF',
             'ICPW_TOCTRENDS_2015_CA_QU',
             'ICPW_TOCTRENDS_2015_CZ',
             'ICPW_TOCTRENDS_2015_Cz2',
             'ICPW_TOCTRENDS_2015_FI',
             'ICPW_TOCTRENDS_2015_NO',
             'ICPW_TOCTRENDS_2015_SE',
             'ICPW_TOCTRENDS_2015_UK',
             'ICPW_TOCTRENDS_2015_US_LTM',
             'ICPWaters Ca']

sql = ('SELECT station_id, station_code '
       'FROM resa2.stations '
       'WHERE station_id IN (SELECT UNIQUE(station_id) '
                            'FROM resa2.projects_stations '
                            'WHERE project_id IN (SELECT project_id '
                                                 'FROM resa2.projects '
                                                 'WHERE project_name IN %s))'
                                                 % str(tuple(proj_list)))    
stn_df = pd.read_sql(sql, engine)
    
    
sql = ("SELECT water_sample_id, station_id, sample_date "
       "FROM resa2.water_samples "
       "WHERE station_id IN %s "
       "AND depth1 <= 1 "
       "AND depth2 <= 1" % str(tuple(stn_df['station_id'].values)))

df = pd.read_sql(sql, engine)

print 'Number of duplicated records:', df.duplicated(subset=['station_id', 'sample_date']).sum()
df[df.duplicated(subset=['station_id', 'sample_date'], keep=False)]


Number of duplicated records: 12
Out[5]:
water_sample_id station_id sample_date
117 834 103 1974-10-29
118 136691 103 1974-10-29
153 424 104 1974-11-05
154 136693 104 1974-11-05
225 743 107 1974-10-29
226 136695 107 1974-10-29
3600 791 115 1974-10-16
3601 136697 115 1974-10-16
3763 1266 120 1974-10-13
3764 19874 120 1974-10-13
4157 1285 135 1974-10-13
4158 136699 135 1974-10-13
4912 1566 158 1975-03-19
4913 20512 158 1975-03-19
4944 928 161 1974-10-12
4945 136703 161 1974-10-12
4979 1417 162 1974-10-18
4980 20177 162 1974-10-18
5469 1714 179 1975-03-28
5470 20847 179 1975-03-28
5502 915 180 1974-10-12
5503 136705 180 1974-10-12
5623 487 183 1974-10-31
5624 136707 183 1974-10-31

4.1.2. Reference ratios

These have been checked and the values in my code agree with those in Heleen's e-mail.

4.1.3. Additional parameters

I have modified the code to include ANC, calculated as

$$ANC = Ca + Mg + K + Na + NH_4 - Cl - SO_4 - NO_3$$

where all values are in $μeq/l$ (as per Heleen's e-mail - see above). However, note that not all sites have complete data for all these chemical parameters e.g. the Newfoundland sites do not report NH4. I believe NH4 is usually a fairly small component of ANC, so in my code I've decided to assume $NH_4 = 0$ unless it's explicitly specified. This means that I can still calculate ANC for the Newfoundland sites (which would otherwise all return "NoData"). Is this reasonable? Are there any other chemical species that can safely be ignored when they're not reported? Check this with Heleen.

Including alkalinity is not straightforward, as there are many different methods and some of them are not fully described in the database. To make use of the alkalinity data I need to convert everything to a common scale and units (preferably in $\mu eq/l$). This can be done for some of the methods using the metadata in RESA2, but for others this is not possible. Having queried this with Heleen, she has suggested ignoring alkalinity for now (see e-mail received 17/10/2016 at 13:19), but this is something to return to later, possibly with help from Øyvind Garmo regarding how to align the various methods. For future reference, the methods in question are listed here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Data_Summaries\alkalin_method_issues.xlsx

4.1.4. Duplicates

Where duplicates are reported which cannot be explained by different sampling depths (e.g. 0 m and 0.5 m, as above), my code now selects the value that has been added to the database most recently. There is relatively little justification for this decision: I haven't been able to get to the bottom of these issues and this step is really just a fudge (my code previously averaged these measurements, which is also dodgy). The main reason for choosing the most recent database addition is that I can generally find these values in the raw data, whereas the older measurements aren't so obvious. However, it's possible that this is more a reflection of my (still limited) knowledge of the data on the NIVA network, rather than because the more recent data is "more correct".

4.1.5. Detection limits

Heleen is happy to replace values below the detection limit with the detection limit itself, which is what my code currently does. See e-mail received 07/10/2016 at 15:18.

This section tests the new code by applying it to the data for all years (i.e. the full adataset for each site).


In [64]:
# Import code for trends analysis
resa2_trends_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
                     r'\Python\icpw\toc_trends_analysis.py')

resa2_trends = imp.load_source('toc_trends_analysis', resa2_trends_path)

The list of projects to consider has been previously agreed with Heleen (see section 3 of this notebook).


In [65]:
# User input
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CA_ATL',
             'ICPW_TOCTRENDS_2015_CA_DO',
             'ICPW_TOCTRENDS_2015_CA_ICPW',
             'ICPW_TOCTRENDS_2015_CA_NF',
             'ICPW_TOCTRENDS_2015_CA_QU',
             'ICPW_TOCTRENDS_2015_CZ',
             'ICPW_TOCTRENDS_2015_Cz2',
             'ICPW_TOCTRENDS_2015_FI',
             'ICPW_TOCTRENDS_2015_NO',
             'ICPW_TOCTRENDS_2015_SE',
             'ICPW_TOCTRENDS_2015_UK',
             'ICPW_TOCTRENDS_2015_US_LTM',
             'ICPWaters Ca']

# Output paths
plot_fold = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
             r'\Results\Trends_Plots')
res_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\Results\res.csv')
dup_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\Results\dup.csv')
nd_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
          r'\Results\nd.csv')

In [5]:
# Run analysis 
res_df, dup_df, nd_df = resa2_trends.run_trend_analysis(proj_list, engine,
                                                        st_yr=None, end_yr=None,
                                                        plot=True, fold=plot_fold)

# Delete mk_std_dev col as not relevant here
del res_df['mk_std_dev']

# Write output
res_df.to_csv(res_csv, index=False)
dup_df.to_csv(dup_csv, index=False)
nd_df.to_csv(nd_csv, index=False)

res_df.head(14).sort_values(by='par_id')


Extracting data from RESA2...
    The database contains duplicate values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Some stations have no relevant data in the period specified. Their IDs are returned in a separate dataframe.

    Done.

Converting units and applying sea-salt correction...
    Done.

Calculating statistics...
    Data series for Al at site 101 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 102 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 103 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 104 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 107 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 109 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 112 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 115 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 118 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 119 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 120 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 121 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 122 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 123 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 128 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 132 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 134 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 135 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 144 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 146 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 147 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 150 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 156 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 158 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 161 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 162 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 163 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 166 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 168 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 170 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 173 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 176 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 179 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 180 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 181 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 182 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 183 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 185 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 192 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 193 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 196 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 12081 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 23468 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 23546 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36547 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36560 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36733 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36739 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36750 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36753 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36793 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36797 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 37063 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Done.

Finished.
Out[5]:
station_id par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val trend sen_slp
10 100 ANC 1986-2015 28 2.133307 3.741420 18.672454 322.0 6.341843 2.270331e-10 increasing 2.041432
0 100 Al 1986-2015 1 276.433655 276.433655 NaN NaN NaN NaN NaN NaN
0 101 Al 1986-2015 3 134.411234 133.233703 15.034625 1.0 0.000000 1.000000e+00 no trend 2.395185
9 100 ECaX_EMgX 1986-2015 28 25.110187 24.922857 4.404364 -158.0 -3.101774 1.923650e-03 decreasing -0.296140
8 100 ECa_EMg 1986-2015 28 29.318452 29.583333 4.774226 -158.0 -3.102985 1.915794e-03 decreasing -0.333333
4 100 ECl 1986-2015 28 18.061224 17.142857 3.572260 -120.0 -2.369597 1.780750e-02 decreasing -0.184694
2 100 EH 1986-2015 28 11.754885 10.368577 5.065416 -214.0 -4.213075 2.519175e-05 decreasing -0.381853
2 101 EH 1986-2015 30 2.038329 1.318257 1.476923 -302.0 -5.376715 7.585719e-08 decreasing -0.104936
5 100 ENO3 1986-2015 28 2.418367 2.000000 1.526747 41.0 0.791033 4.289247e-01 no trend 0.028571
3 100 ESO4 1986-2015 28 37.526786 31.250000 18.794670 -327.0 -6.446919 1.141465e-10 decreasing -2.001984
6 100 ESO4X 1986-2015 28 35.666480 29.778571 18.611860 -329.0 -6.481403 9.087331e-11 decreasing -1.945556
7 100 ESO4_ECl 1986-2015 28 55.588010 46.943452 20.825615 -305.0 -6.007154 1.888077e-09 decreasing -2.122321
1 100 TOC 1986-2015 28 6.053929 6.000000 1.277101 221.0 4.351249 1.353642e-05 increasing 0.100000
1 101 TOC 1986-2015 30 3.462000 3.500000 0.724533 218.0 3.882850 1.032394e-04 increasing 0.059565

6. Data issues

The code seems to be working correctly, but I still have concerns about some of the values in RESA. Looking at the output (res.csv), the following issues are fairly obvious:

  • A number of the Swedish stations have ridiculously high EH+ (i.e. low pH). They can be idenified with the following SQL:

     SELECT * FROM RESA2.STATIONS
     WHERE STATION_ID IN (36797, 36733, 36753, 36793, 36588, 36660, 36578, 
                          36750, 36670, 36825, 36584, 36680, 36788, 36575,
                          36813, 36636, 36690, 36592, 36675, 36826, 36711,
                          36723, 36731, 36739);
    
    

    As far as I can tell, my trends code is working correctly, but the values in RESA2 for these sites must surely be wrong (pH is often < 0.03)? Try to trace this.

  • Sites 37124 and 37129 also have very high EH+/low pH. A quick check suggets that zeros have been entered into RESA instead of NoData in a few cases - this needs correcting.

  • Overall, the entire dataset would benefit from being checked by someone with a better feel for realistic acid chemistry values than me.

    ### 6.1. Correct sites 37124 and 37129

    I've removed the zeros for pH from these datasets, as they've obviously been entered in error.

    ### 6.2. Correct Swedish sites

Where have the very low pH values for the 24 Swedish sites come from? The raw data from the Focal Centre for these stations is here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Tilsendte data fra Focalsentere\Sweden\EKSTRA2015

These files contain sensible values for pH (usually around 7), so what's gone wrong? The filenames ending with the letters “ED” denote spreadsheets created by Tore for uploading into the database. He usually uses the ICPW template for this, but in this case he’s done something different. Strangely, pH seems to be missing from the Lakesfrom 1988 3y sheet of the "ED" files and it looks as though something has gone wrong with the upload process. In fact, it looks suspiciously like the values for Pb (in ug/l) have been entered as pH by mistake.

Correcting this is fiddly, because I suspect the problem actually affects a large number of Swedish sites, not just the ones listed above. The difficulty is that these errors aren't obvious in cases where the Pb concentrations are high enough to be sensibly interpreted as pH.

As a start, I've copied the correct pH data from the raw data file over to:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016\Replies\sweden\correct_ph_error.xls

Let's see how this compares to what's in the database.


In [3]:
# Read data from template
raw_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
           r'\Replies\sweden\correct_ph_error.xls')
raw_df = pd.read_excel(raw_xls, sheetname='Data')
raw_df.head()


Out[3]:
Code Name Date pH Pb
0 622410-135589 Fåglasjön 1988-03-15 6.15 NaN
1 622410-135589 Fåglasjön 1988-08-23 6.98 NaN
2 622410-135589 Fåglasjön 1988-11-16 6.55 NaN
3 622410-135589 Fåglasjön 1989-02-28 6.53 NaN
4 622410-135589 Fåglasjön 1989-11-07 6.57 NaN

In [4]:
# Get unique list of site codes
stn_list = raw_df['Code'].unique()

# Get station_ids
sql = ('SELECT * FROM resa2.stations '
       'WHERE station_code IN %s' % str(tuple(stn_list)))
stn_df = pd.read_sql(sql, engine)

# Decode special characters from `windows-1252` encoding to unicode
stn_df['station_name'] = stn_df['station_name'].str.decode('windows-1252')

stn_df.head()


Out[4]:
station_id station_code station_name lake_or_river lake_river_name latitude longitude altitude map_sheet country_no ... catchment_peat_area catchment_water_area station_description entered_by entered_date blr old_id utmn utme zone
0 36611 622410-135589 Fåglasjön L None 55.521853 13.311919 107 None 46 ... 13.938492 19.593254 None TOH 2015-06-01 10:07:22 None None None None None
1 36765 632515-146675 Hjärtsjön L None 57.006072 13.579214 275 None 46 ... 10.140524 19.369540 None TOH 2015-06-01 10:07:27 None None None None None
2 36767 650061-142276 Humsjön L None 56.876783 14.928576 130 None 46 ... 5.251641 22.100656 None TOH 2015-06-01 10:07:27 None None None None None
3 36768 667151-149602 Hällsjön L None 57.050182 15.257347 168 None 46 ... 2.671119 13.355593 None TOH 2015-06-01 10:07:27 None None None None None
4 36769 704955-159090 Hällvattnet L None 57.867520 18.840905 216 None 46 ... 12.405116 8.455629 None TOH 2015-06-01 10:07:27 None None None None None

5 rows × 30 columns


In [14]:
# Get all samples from these sites
sql = ('SELECT * FROM resa2.water_samples '
       'WHERE station_id IN %s' % str(tuple(stn_df['station_id'])))
samp_df = pd.read_sql(sql, engine)

samp_df.head()


Out[14]:
water_sample_id station_id sample_date depth1 depth2 project_id wl_testno wl_serialno wl_marking wl_ident remark entered_by entered_date
0 549209 36561 1988-06-21 0 0 None None None None None None TOH 24.04.2015
1 549210 36561 1988-08-10 0 0 None None None None None None TOH 24.04.2015
2 549211 36561 1988-09-12 0 0 None None None None None None TOH 24.04.2015
3 549212 36561 1989-03-14 0 0 None None None None None None TOH 24.04.2015
4 549213 36561 1989-06-14 0 0 None None None None None None TOH 24.04.2015

In [17]:
# Get all the pH data for these sites
sql = ('SELECT * FROM resa2.water_chemistry_values2 '
       'WHERE sample_id IN (SELECT water_sample_id '
                           'FROM resa2.water_samples '
                           'WHERE station_id IN %s) '
       'AND method_id = 10268' % str(tuple(stn_df['station_id'])))
wc_df = pd.read_sql(sql, engine)

wc_df.head()


Out[17]:
value_id sample_id method_id value flag1 flag2 approved remark entered_by entered_date detection_limit labware_status
0 5418659 549209 10268 6.84 None None YES None TOH 2015-03-25 12:19:23 None None
1 5418660 549210 10268 6.72 None None YES None TOH 2015-03-25 12:19:23 None None
2 5418661 549211 10268 7.19 None None YES None TOH 2015-03-25 12:19:23 None None
3 5418662 549212 10268 7.20 None None YES None TOH 2015-03-25 12:19:24 None None
4 5418663 549213 10268 6.97 None None YES None TOH 2015-03-25 12:19:24 None None

In [46]:
# Join samples
df = pd.merge(wc_df, samp_df, how='left',
              left_on='sample_id', 
              right_on='water_sample_id')

# Join stations
df = pd.merge(df, stn_df, how='left',
              left_on='station_id',
              right_on='station_id')

# Join raw data
df = pd.merge(df, raw_df, how='left',
              left_on=['station_code', 'sample_date'],
              right_on=['Code', 'Date'])

# Extract columns of interest
df = df[['station_id', 'station_code', 'station_name', 'sample_date', 
         'sample_id', 'method_id', 'flag1', 'value', 'pH', 'Pb']]

df.head(10)


Out[46]:
station_id station_code station_name sample_date sample_id method_id flag1 value pH Pb
0 36561 758208-161749 Abiskojaure 1988-06-21 549209 10268 None 6.84 NaN NaN
1 36561 758208-161749 Abiskojaure 1988-08-10 549210 10268 None 6.72 6.72 NaN
2 36561 758208-161749 Abiskojaure 1988-09-12 549211 10268 None 7.19 7.19 NaN
3 36561 758208-161749 Abiskojaure 1989-03-14 549212 10268 None 7.20 7.20 NaN
4 36561 758208-161749 Abiskojaure 1989-06-14 549213 10268 None 6.97 NaN NaN
5 36561 758208-161749 Abiskojaure 1989-08-23 549214 10268 None 7.08 7.08 NaN
6 36561 758208-161749 Abiskojaure 1989-10-11 549215 10268 None 7.12 7.12 NaN
7 36561 758208-161749 Abiskojaure 1990-04-17 549216 10268 None 6.96 6.96 NaN
8 36561 758208-161749 Abiskojaure 1990-08-14 549217 10268 None 6.64 6.64 NaN
9 36561 758208-161749 Abiskojaure 1990-09-11 549218 10268 None 7.03 7.03 NaN

The value column in the above table shows pH measurements according to RESA2. The pH and Pb columns show the values in the raw data. It is clear that the database values agree with those in the raw data in many places (i.e. the database is correct), which only makes this issue stranger: I would expect all of this data to be uploaded in a single go, so it's surprising that the results are correct for some sites but not for others. I'm not sure what's going on here!

Nevertheless, there are 310 occasions where values for Pb have been entered as pH by mistake.


In [47]:
print 'Number of occasions where Pb entered instead of pH:', len(df.query('value != pH').dropna(subset=['pH',]))

df2 = df.query('value != pH').dropna(subset=['pH',])
df2.head()


Number of occasions where Pb entered instead of pH: 310
Out[47]:
station_id station_code station_name sample_date sample_id method_id flag1 value pH Pb
1220 36575 662682-132860 �rvattnet 1998-10-27 594988 10268 None 0.23 5.41 0.23
1221 36575 662682-132860 �rvattnet 1999-11-02 594990 10268 None 0.25 5.28 0.25
1222 36575 662682-132860 �rvattnet 2001-10-17 594996 10268 None 0.47 5.43 0.47
1223 36575 662682-132860 �rvattnet 2000-10-18 594993 10268 None 0.36 5.33 0.36
1224 36575 662682-132860 �rvattnet 2002-10-22 594999 10268 None 0.36 5.25 0.36

The next step is to loop over these 310 records, replacing the numbers in the value column with those in the pH column. Note that a few of the Pb entries were associated with "less than" flags, so these need clearing as well.


In [55]:
df2.query('flag1 == "<"')


Out[55]:
station_id station_code station_name sample_date sample_id method_id flag1 value pH Pb
5655 36797 731799-151196 Stor-Tjultr�sket 2006-10-11 597377 10268 < 0.02 7.32 <0,02
5656 36797 731799-151196 Stor-Tjultr�sket 2007-10-04 597380 10268 < 0.02 7.44 <0,02
5657 36797 731799-151196 Stor-Tjultr�sket 2010-09-25 597389 10268 < 0.02 7.31 <0,02

In [61]:
# Loop over rows
for index, row in df2.iterrows():
    # Get data
    samp_id = row['sample_id']
    ph = row['pH']
    
    # Update chem table
    sql = ("UPDATE resa2.water_chemistry_values2 "
           "SET value = %s, "
               "flag1 = NULL "
           "WHERE sample_id = %s "
           "AND method_id = 10268"
           % (ph, samp_id))

    result = conn.execute(sql)

In [62]:
# Check changes have taken effect
sql = ('SELECT * FROM resa2.water_chemistry_values2 '
       'WHERE sample_id = 597377 '
       'AND method_id = 10268')

df3 = pd.read_sql(sql, engine)

df3


Out[62]:
value_id sample_id method_id value flag1 flag2 approved remark entered_by entered_date detection_limit labware_status
0 6124953 597377 10268 7.32 None None YES None JES 2016-10-19 17:36:55 None None

Finally, re-run the trends code to make sure the EH+ values are now more sensible.


In [68]:
# Run analysis 
res_df, dup_df, nd_df = resa2_trends.run_trend_analysis(proj_list, engine,
                                                        st_yr=None, end_yr=None,
                                                        plot=False, fold=None)

# Delete mk_std_dev col as not relevant here
del res_df['mk_std_dev']

res_df.sort_values(by='mean', ascending=False).head()


Extracting data from RESA2...
    The database contains duplicate values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Some stations have no relevant data in the period specified. Their IDs are returned in a separate dataframe.

    Done.

Converting units and applying sea-salt correction...
    Done.

Calculating statistics...
    Data series for Al at site 101 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 102 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 103 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 104 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 107 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 109 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 112 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 115 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 118 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 119 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 120 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 121 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 122 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 123 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 128 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 132 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 134 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 135 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 144 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 146 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 147 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 150 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 156 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 158 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 161 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 162 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 163 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 166 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 168 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 170 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 173 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 176 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 179 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 180 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 181 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 182 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 183 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 185 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 192 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 193 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 196 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 12081 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 23468 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 23546 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36547 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36560 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36733 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36739 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36750 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36753 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36793 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36797 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 37063 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Done.

Finished.
Out[68]:
station_id par_id period non_missing mean median std_dev mk_stat norm_mk_stat mk_p_val trend sen_slp
0 37834 Al 1989-2015 27 1006.166667 960.000000 223.482532 -207.0 -4.296331 0.000017 decreasing -20.000000
7 36848 ESO4_ECl 1990-2012 23 737.357660 719.821429 94.117998 -89.0 -2.324120 0.020119 decreasing -6.019918
7 36734 ESO4_ECl 1988-2013 26 714.903226 717.179886 143.143990 23.0 0.484914 0.627737 no trend 2.861632
0 37830 Al 1984-2015 31 704.047419 670.000002 184.115978 -183.0 -3.093348 0.001979 decreasing -12.420000
0 36567 Al 1988-2013 25 631.800000 585.000000 245.445471 -190.0 -4.414089 0.000010 decreasing -24.722222

Good. (Previously, the mean and median columns included values close to $1.10^6$ for EH, corresponding to pH values of approximately 0. In the newly cleaned data, the largest value in the mean column is 1006, this time associated with Al. I don't know enough about Al to say whether this is reasonable, but it's at least a step in the right direction).