In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')

TOC trends October 2016

Heleen would like updated results for the ICPW trends analysis at the beginning of October. Today is the last day of September and I'm committed to other projects for the next few weeks, so time is getting tight! The aim of this notebook is to get some more-or-less complete results that Heleen can work with while I'm away next week.

Note: There are still a number of significant issues with the database that I am unlikely to have time to fix in the near future. Although the ICPW dataset itself is pretty simple, cleaning up the existing system is not a small job, due to a long history of database structural development and, in particular, the way that RESA2 is coupled to various other systems. I'm not going to attempt to deal with these issues here, but I'll try to describe them below so I can come back in the future. The aim here is to try to produce something useful despite the outstanding known (and unknown?) issues.

1. ICPW database summary

Most of the tasks disucssed with John, Don and Heleen back in May have been completed, as described here, here and here. Section 5 of the latter notebook provides a more detailed summary.

On 25/08/2016, John sent a couple of very helpful e-mails in response to my earlier questions regarding the US sites. Unfortunately, comparing John's spreadsheets to the information in RESA2 highlighted a number of further database-wide errors (not specific to the US sites), which are going to be tricky to fix. A notebook describing this work in detail is here, but the main implications are summarised in an e-mail to Heleen (sent 02/09/2016 at 11:42), which is reproduced in part below.

"...it is not possible to reliably extract complete water chemistry time series from RESA2, even for just a single chemical parameter at a single site. This is because, if you choose a site that appears in both the core ICPW programme and in the wider trends analysis (which is most of them), you will find two time series in the database: one associated with 'ICPWaters' and another linked to 'TOC_TRENDS'. These two series will likely have different start and end points and different periods of missing data. There will also be a lot of overlap, and most of the overlapping values will agree, but in a few places they will be substantially different. What's more, at present the database will often report significantly different site properties for the two time series (i.e. different station names, geographic co-ordinates and catchment properties), despite the fact that the samples actually all come from a single location. All this means that if you want to look at a complete water chemistry time series for any site in ICP Waters, the only way to do it at the moment is to extract both series, manually merge them (e.g. in Excel) and then try to figure out which set of site properties is correct.

[...]

At present it is possible for someone using RESA2 to extract two different time series and two sets of site properties for a single location. This is pretty confusing (as we found out back in May), and it also somewhat defeats the point of having a database."

Other outstanding issues include the large number of Swedish sites with no data, problems with the Czech data (in an e-mail received 08/09/2016, Vladimir suggested just deleting and reloading everything) and the poor availability of site metadata for many locations.

Having discussed these issues with Heleen, I've decided to try the following:

Tidy up the US sites based on information provided by John. I'm not going to attempt to merge any datasets at this stage, but I can clean up the site names and add the original USEPA site codes to reduce confusion.
Look at the Czech data sent by Vladimir and decide whether it can and should replace the values currently in the database.
Run the trends analysis using whatever data is currently associated with the TOC_TRENDS projects. In principle, these datasets were gathered together separately from ICPW for the 2015 reanalysis, which is why they're associated with separate projects in the database. As described in the notebook linked above, the values in these series sometimes do not agree with those reported for the core ICPW projects, but I'm going to ignore this for now as reconciling the differences will take a long time.



In [2]:

    
# Import custom functions and connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
                    r'\useful_resa2_code.py')

resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)

engine, conn = resa2_basic.connect_to_resa2()

2. Tidy up US sites

Based on the results in the previous notebook, I'd like to go through all the US sites in RESA2 and modify the site properties to match the values in John's spreadsheet (which I'm taking to be definitive). In particular, I want to correct the station names and geographic co-ordinates, as well as appending the original USEPA site codes to the station metadata.

2.1. Core ICPW sites

We'll start by correcting the core ICPW sites. John's spreadsheet identified 95 sites that should be associated with this project and, following the work in the previous notebook, this now agrees with what's in RESA2. Four of the core ICPW sites in John's spreadsheet are marked NA in the station_code column. These sites do now have codes in RESA2, so I've created a new version of John's spreadsheet (U.S.Site.Reconciliation.August.2016_jes.xlsx) with these codes added. The first step is therefore to check that there is a direct match between codes for the 95 core sites in RESA2 and the the 95 sites listed in John's spreadsheet.



In [3]:

    
# Read John's spreadsheet
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
           '\Replies\usa\U.S.Site.Reconciliation.August.2016_jes.xlsx')
j_df = pd.read_excel(in_xls, sheetname='Sheet1')

# Get a subset of columns from John's spreadsheet
us_df = j_df[['Station Code', 'Station name', 'Suggested Name',
              'Active', 'INCLUDE IN ICP WATERS DATABASE', 
              'INCLUDE IN ICP DOC ANALYSIS', 'NFC_SITEID',
              'NFC_SITENAME', 'Latitude', 'Longitude', 'Altitude (m)']]

# Rename columns
us_df.columns = ['station_code', 'station_name', 'suggested_name',
                 'active', 'core', 'trend', 'nfc_code', 'nfc_name', 
                 'latitude', 'longitude', 'elevation']

# Just the core sites
core_df = us_df.query('core == "YES"')
core_df.head()









    Out[3]:






  
    
      
      station_code
      station_name
      suggested_name
      active
      core
      trend
      nfc_code
      nfc_name
      latitude
      longitude
      elevation
    
  
  
    
      0
      US24
      New York, Catskill Mnt., Rondout Creek
      Rondout Creek, Catskills
      YES
      YES
      YES
      01364959
      RONDOUT CREEK
      41.9367
      -74.3764
      503.0
    
    
      1
      US25
      W Br Neversink R At Winnisook, Catskills
      West Branch Neversink River at Winnisook, Cats...
      YES
      YES
      YES
      01434021
      W BR NEVERSINK R AT WINNISOOK
      41.9725
      -74.4483
      816.0
    
    
      2
      US26
      Biscuit Brook, Catskills
      Biscuit Brook, Catskills
      YES
      YES
      YES
      01434025
      BISCUIT BROOK
      42.0111
      -74.4147
      634.0
    
    
      3
      US23
      New York, Catskill Mnt., E. Branch Neversink, ...
      East Branch Neversink River, Catskills
      YES
      YES
      YES
      0143400680
      EAST BRANCH NEVERSINK, HEADWAT
      41.9869
      -74.5031
      652.0
    
    
      4
      US27
      Little Hope Pond, Adirondacks
      Little Hope Pond, Adirondacks
      YES
      YES
      NO, road salt contamination
      020058O
      LITTLE HOPE POND
      44.5158
      -74.1252
      521.0



In [37]:

    
# Get station codes associated with 'ICP Waters US' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 1679)')
r2_df = pd.read_sql_query(sql, engine)
r2_df.head()









    Out[37]:






  
    
      
      station_id
      station_code
      station_name
      latitude
      longitude
    
  
  
    
      0
      23659
      US24
      New York, Catskill Mnt., Rondout Creek
      41.9600
      -74.4500
    
    
      1
      23658
      US23
      New York, Catskill Mnt., E. Branch Neversink, ...
      43.1800
      -74.5000
    
    
      2
      23660
      US25
      W Br Neversink R At Winnisook, Catskills
      42.0111
      -74.4147
    
    
      3
      23661
      US26
      Biscuit Brook, Catskills
      41.9869
      -74.5031
    
    
      4
      23662
      US27
      Little Hope Pond, Adirondacks
      44.5158
      -74.1252



In [30]:

    
# Join
df = r2_df.merge(core_df, on='station_code', how='outer',
                 suffixes=('_r', '_j'))

if len(df) == 95:
    print 'All rows match.'
    
df.head()









    



All rows match.






    Out[30]:






  
    
      
      station_id
      station_code
      station_name_r
      latitude_r
      longitude_r
      station_name_j
      suggested_name
      active
      core
      trend
      nfc_code
      nfc_name
      latitude_j
      longitude_j
      elevation
    
  
  
    
      0
      23659
      US24
      New York, Catskill Mnt., Rondout Creek
      41.9600
      -74.4500
      New York, Catskill Mnt., Rondout Creek
      Rondout Creek, Catskills
      YES
      YES
      YES
      01364959
      RONDOUT CREEK
      41.9367
      -74.3764
      503.0
    
    
      1
      23658
      US23
      New York, Catskill Mnt., E. Branch Neversink, ...
      43.1800
      -74.5000
      New York, Catskill Mnt., E. Branch Neversink, ...
      East Branch Neversink River, Catskills
      YES
      YES
      YES
      0143400680
      EAST BRANCH NEVERSINK, HEADWAT
      41.9869
      -74.5031
      652.0
    
    
      2
      23660
      US25
      W Br Neversink R At Winnisook, Catskills
      42.0111
      -74.4147
      W Br Neversink R At Winnisook, Catskills
      West Branch Neversink River at Winnisook, Cats...
      YES
      YES
      YES
      01434021
      W BR NEVERSINK R AT WINNISOOK
      41.9725
      -74.4483
      816.0
    
    
      3
      23661
      US26
      Biscuit Brook, Catskills
      41.9869
      -74.5031
      Biscuit Brook, Catskills
      Biscuit Brook, Catskills
      YES
      YES
      YES
      01434025
      BISCUIT BROOK
      42.0111
      -74.4147
      634.0
    
    
      4
      23662
      US27
      Little Hope Pond, Adirondacks
      44.5158
      -74.1252
      Little Hope Pond, Adirondacks
      Little Hope Pond, Adirondacks
      YES
      YES
      NO, road salt contamination
      020058O
      LITTLE HOPE POND
      44.5158
      -74.1252
      521.0

The code above shows that all the sites have been matched correctly, so I can now loop over each site in the ICP Waters US project, updating the details in the stations table using the information in John's spreadsheet.

I also want to add the original National Focal Centre site code as an additional attribute (as John suggested in his e-mails). The easiest way to do this is to create a new entry (NFC_Code) in the STATION_PARAMETER_DEFINITIONS table.



In [48]:

    
# Loop over sites
for row in df.iterrows():
    # Get site properties
    stn_id = row[1]['station_id']
    name = row[1]['suggested_name']
    lat = row[1]['latitude_j']
    lon = row[1]['longitude_j']
    elev = row[1]['elevation']
    nfc = row[1]['nfc_code']

    # Update stations table
    sql = ("UPDATE resa2.stations "
           "SET station_name = '%s', "
               "latitude = %s, "
               "longitude = %s, "
               "altitude = %s "
           "WHERE station_id = %s" 
           % (name, lat, lon, elev, stn_id))

    result = conn.execute(sql)

    # Update stations_par_values table with NFC code
    sql = ("INSERT INTO resa2.stations_par_values "
           "(station_id, var_id, value, entered_by, entered_date) "
           "VALUES (%s, 321, '%s', 'JES', TO_DATE('2016-09-30', 'YYYY-MM-DD'))"
           % (stn_id, nfc))

    result = conn.execute(sql)

2.2. Trends sites

The next step is to correct the entries for the trends sites. Note that if the database was properly normalised this step wouldn't be necessary, as the 76 trends sites are a sub-set of the 95 core ICPW sites, so the steps described above should cover all relevant sites. However, due to the duplication of sites in the database, it is necessary to do this cleaning twice.

Matching the trends sites turns out to be a bit more difficult, because of inconsistencies in the X15: prefixes. The first step is to manually add Newbert Pond, Maine (station_code = US82) to the ICPW_TOCTRENDS_2015_US_LTM project, as mentioned at the end of the previous notebook. This should mean that we have 76 sites associated with the trends project, as specified in John's spreadsheet.



In [4]:

    
# Just the trend sites
trend_df = us_df.query('trend == "YES"')
trend_df.head()









    Out[4]:






  
    
      
      station_code
      station_name
      suggested_name
      active
      core
      trend
      nfc_code
      nfc_name
      latitude
      longitude
      elevation
    
  
  
    
      0
      US24
      New York, Catskill Mnt., Rondout Creek
      Rondout Creek, Catskills
      YES
      YES
      YES
      01364959
      RONDOUT CREEK
      41.9367
      -74.3764
      503.0
    
    
      1
      US25
      W Br Neversink R At Winnisook, Catskills
      West Branch Neversink River at Winnisook, Cats...
      YES
      YES
      YES
      01434021
      W BR NEVERSINK R AT WINNISOOK
      41.9725
      -74.4483
      816.0
    
    
      2
      US26
      Biscuit Brook, Catskills
      Biscuit Brook, Catskills
      YES
      YES
      YES
      01434025
      BISCUIT BROOK
      42.0111
      -74.4147
      634.0
    
    
      3
      US23
      New York, Catskill Mnt., E. Branch Neversink, ...
      East Branch Neversink River, Catskills
      YES
      YES
      YES
      0143400680
      EAST BRANCH NEVERSINK, HEADWAT
      41.9869
      -74.5031
      652.0
    
    
      5
      US28
      Big Hope Pond, Adirondacks
      Big Hope Pond, Adirondacks
      YES
      YES
      YES
      020059O
      BIG HOPE POND
      44.5119
      -74.1250
      522.0



In [3]:

    
# Get station codes associated with 'ICPW_TOCTRENDS_2015_US_LTM' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 3870)')

r2_df = pd.read_sql_query(sql, engine)

print 'Number of sites:', len(r2_df)

r2_df.head()









    



Number of sites: 76






    Out[3]:






  
    
      
      station_id
      station_code
      station_name
      latitude
      longitude
    
  
  
    
      0
      36983
      X15:020059O
      Big Hope Pond, Adirondacks
      44.5119
      -74.1250
    
    
      1
      36984
      X15:020138O
      East Copperas Pond, Adirondacks
      44.3119
      -74.3722
    
    
      2
      36985
      X15:020188E
      Sunday Pond, Adirondacks
      44.3447
      -74.3005
    
    
      3
      36986
      X15:020197E
      Sochia Pond, Adirondacks
      44.3522
      -74.2947
    
    
      4
      36987
      X15:020265O
      Marcy Dam Pond, Adirondacks
      44.1589
      -73.9530

The next question is how well we can match these sites to John's spreadsheet. Based on the analysis in the previous notebook, the answer is, "not very well", because leading and trailing zeros in the original site codes have been truncated (presumably accidently, e.g. in Excel) prior to the X15: prefix being added. This isn't necessarily a major problem - the station codes used in RESA2 are mostly irrelevant - but I do need to somehow match them to John's spreadsheet and then update the station properties (preferably also adding the original site codes supplied by John, so we don't have to go through all this again later). The code below takes a slightly "heuristic" approach to finding matches.



In [121]:

    
# Loop over rows from John
for row in trend_df.iterrows():
    # Get site properties
    nfc_cd = row[1]['nfc_code']
    name = row[1]['suggested_name']
    lat = row[1]['latitude']
    lon = row[1]['longitude']
    elev = row[1]['elevation']  
        
    # Attempt to find match. Need to add 'X15:' and allow for variants
    q_res = r2_df[(r2_df['station_code']=='X15:%s' % nfc_cd) |
                  (r2_df['station_code']=='X15:%s' % nfc_cd[1:]) |
                  (r2_df['station_code']=='X15:%s' % nfc_cd[:-1])]
       
    if len(q_res) == 1:
        # Single match found. Get stn_id
        stn_id = q_res.iloc[0]['station_id']

        # Update stations table
        sql = ("UPDATE resa2.stations "
               "SET station_name = '%s', "
                   "latitude = %s, "
                   "longitude = %s, "
                   "altitude = %s "
               "WHERE station_id = %s" 
               % (name, lat, lon, elev, stn_id))

        result = conn.execute(sql)


        # Check whether there's already an entry for this site
        # in stations_par_values table
        sql = ('SELECT * FROM resa2.stations_par_values '
               'WHERE station_id = %s '
               'AND var_id = 321'
               % stn_id)
        
        df = pd.read_sql_query(sql, engine)
        
        if len(df) < 1:
            # Update stations_par_values table with NFC code        
            sql = ("INSERT INTO resa2.stations_par_values "
                   "(station_id, var_id, value, entered_by, entered_date) "
                   "VALUES (%s, 321, '%s', 'JES', TO_DATE('2016-09-30', 'YYYY-MM-DD'))"
                   % (stn_id, nfc_cd))

            result = conn.execute(sql)
       
    else:
        # Can't get good match
        print "Can't match %s." % nfc_cd









    



Can't match 1E1-134E.
Can't match ME-9998E.

This code manages to find unique matches for all but two of the sites, which is a good start. Looking at the site codes for the two exceptions in John's spreadsheet, it seems as though they were previously only associated with core ICPW project and not the broader trends analysis. They were therefore not duplicated when the TOC Trends projects were created and instead only appear in the database once, using station codes US74 and US82, respectively (rather than any of the X15: stuff).

2.3. Testing

With a bit of luck, I've finally managed to sort out the basic details for the US sites. Let's check.



In [4]:

    
# Get the NFC site codes
sql = ('SELECT station_id, value AS nfc_code '
       'FROM resa2.stations_par_values '
       'WHERE var_id =321')
nfc_df = pd.read_sql_query(sql, engine)

# Get station codes associated with 'ICP Waters US' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude, altitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 1679)')
core_df = pd.read_sql_query(sql, engine)

# Get station codes associated with 'ICPW_TOCTRENDS_2015_US_LTM' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude, altitude '
       'FROM resa2.stations '
       'WHERE station_id in (SELECT station_id '
                             'FROM resa2.projects_stations '
                             'WHERE project_id = 3870)')
trend_df = pd.read_sql_query(sql, engine)

# Join in original site codes
core_df = core_df.merge(nfc_df, on='station_id', how='left')
trend_df = trend_df.merge(nfc_df, on='station_id', how='left')

print 'Sites in core ICPW project:', len(core_df)
print 'Sites in trends project:   ', len(trend_df)









    



Sites in core ICPW project: 95
Sites in trends project:    76

Regardless of the station codes and IDs in RESA2, I should now be able to make sense of the US data using the actual USEPA codes. For example, the sites in trend_df should be a true sub-set of those in core_df, and all the site properties should agree.



In [9]:

    
# Inner join dfs
df = trend_df.merge(core_df, on='nfc_code', how='inner', 
                    suffixes=('_t', '_c'))

# Testing
assert len(df) == 76, 'Incorrect number of sites.'

for item in ['station_name', 'latitude', 'longitude', 'altitude']:
    assert (df[item + '_t'] == df[item + '_c']).all(), 'Mismatch in %ss.'

print 'Check complete. All properties match.'









    



Check complete. All properties match.

Great - I think the US LTM sites are now more-or-less sorted. Only another 20 countires to go! Note that, with the changes made above, it is now possible to extract the NFC station codes from RESA2 in the same way as for the other station properties (i.e. using the Additional station data tab). There are still some issues to be aware of though. In particular, if you choose to export station properties from RESA2 to Excel, the RESA2 code will convert any all-numeric NFC codes to numbers. The result is that NFC codes such as 013425 are truncated as 12345. This is not a problem with the database (the values in Oracle are correct) - it is a problem with the RESA2 code that copies results from the database into Excel. I'm not going to delve into this at the moment as I'm keen to avoid becoming too involved with the RESA2 application itself. As a workaround, it is safer to export from RESA2 as CSV and then import the CSV into Excel, taking care to set the column type for the nfc_code field to Text during the import process.

3. Czech data

Vladimir has suggested deleting all the Czech data and then uploading it again from scratch. I'm reluctant to delete more than a decade's worth of data, but an alternative option more consistent with what's been done before would be to rename the existing Czech site codes and shift them into and EXCLUDED project. I can then upload the new Czech data using the same site codes as previously, which will hopefully avoid the issues created when the trend data was similarly uploaded, but using modified (i.e. X15:) prefixes.

Firstly, some points to note:

The "old" database has Czech data for 9 sites. Two of these, Lysina (CZ07) and Pluhuv Bor (CZ09) have higher resolution data than the rest. See Section 3 of this notebook for full details.
In his latest data submission, Vladimir has only supplied monthly resolution data for sites CZ01 to CZ08 inclusive (i.e. excluding Pluhuv Bor). This is exactly what's required for the core ICPW dataset, but we may wish to include some of old (weekly) data from Lysina and Pluhuv Bor in the trends analysis (assuming it's considered good enough to use).

For the moment, I propose shifting all of the existing Czech data (sites CZ01 to CZ09) into a new project called ICPWaters CZ Excl. I will then upload the new data for sites CZ01 to CZ08 and associate these records with both the core ICPW project and the trend analysis work. Both projects will therefore access exactly the same (monthly resolution) data, which is more consistent than the data structure used at present. The downsides are that the trends project will then make use of lower resolution (monthly rather than weekly) data for Lysina (CZ07) and have no data at all for Pluhuv Bor (CZ09). Given that we are estimating trends from annual averages anyway, I don't think the differences in temporal resolution are a problem (in fact, using monthly data is arguably better as it's more consistent with our statistical analyses elsewhere). It is also possible to include Pluhuv Bor in the trends project if we want to, but Jakub has previously stated that it is a very well-buffered catchment that does not suffer from acidification (see e-mail received 29/06/2016 at 10:58), so I'm not sure it's appropriate anyway? Check with Heleen.

3.1. Restructure Czech projects

The first step is to create a new project called ICPWaters CZ Excl. I can then rename all the existing Czech site codes (CZ01 etc.) by adding the suffix _Old, followed by shifting them over to the newly created project. The second step is to create 8 new sites (with identical properties to the old ones), assigning them the same site codes as used previously (CZ01 etc.). These new sites have no data associated with them, so I should be able to upload the revised data from Vladimir without creating any conflicts (I hope). All of this is most easily done manually using Access.

Note: This process involves creating duplicate sites and is therefore superficially similar to the duplication described above, which has already caused lots of confusion. However, these problems have primarily been caused because we have two sets of sites (core and trends), which represent the same locations but which often have different site properties. More importantly, both of these projects are considered "active", and data is periodically appended to update them. This is what causes the problems: we have two supposedly identical datasets evolving in parallel, and over time differences emerge that are difficult to correct.

For the Czech sites, I'm moving all the old data into an EXCLUDED project, marking the site names as OLD, and also adding a description to the STATIONS table saying they are Associated with data supplied prior to August 2016. In principle, I'd be happy to delete all of this data entirely (which would remove any concerns about duplication and database normalisation etc.), but I don't want to lose any information if I can avoid it. Besides, deleting records from RESA2 is not straightforward, due to all the other database interactions (a topic I'd like to avoid getting involved in for the moment). My hope is that the changes I'm making here will not cause further confusion, because the sites with the suffix _Old will be discontinued completely i.e. no new data will be associated with them and they won't be used in subsequent projects. This should avoid the messy situation that we're now in with sites from other countries.

3.2. Upload new data

With the new project structure in place, I can use upload_icpw_template.ipynb to load the latest data from Vladimir into the database. This seems to have worked successfully, but for some reason the available parameters for the new sites are not showing up properly in the RESA2 application. They are there, and the values appear to be correct, but the Refresh parameter list button is not working as it used to. However, if the STANDARD button is pressed to select the routine parameters of interest, everything works as expected and the data can be exported to Excel as usual. The bit that's missing is that the user is no longer presented with the option of selecting custom parameters. I'm not sure what's happening here - another of the mysteries of RESA2! I'll have to ask Tore where his code pulls this information from and modify the database accordingly, but I'm not going to worry about this for now, as all my code will interact directly with the underlying Oracle database itself.

3.3. Data checking

To check that my changes have worked as expected, I want to compare the data now in the database with what's in Vladimir's spreadsheet. To do this, I've manually extracted time series for the 8 Czech sites from RESA2 and saved them in check_czech.xlsx. The code below plots these values from RESA2 against the raw values in Valdimir's spreadsheet, which hopefully should agree.



In [29]:

    
# Read data
in_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
            '\Replies\czech_republic\check_czech.xlsx')
r2_df = pd.read_excel(in_xlsx, sheetname='RESA')
vl_df = pd.read_excel(in_xlsx, sheetname='Orig')



In [35]:

    
# Define params of interest
params = ['pH', 'Ca', 'Mg', 'Na', 'K', 'Cl', 'SO4', 'ALK', 'TOC']

fig, axes = plt.subplots(nrows=9, ncols=8, figsize=(15, 20))

# Loop over data
for row, param in enumerate(params):
    # Get data
    df1 = r2_df[['Code', 'Date', param]]
    df2 = vl_df[['Code', 'Date', param]]
    
    # Pivot
    df1 = df1.pivot(index='Date', columns='Code', values=param)
    df2 = df2.pivot(index='Date', columns='Code', values=param)
    
    # Join
    df = df1.merge(df2, how='outer', 
                   left_index=True, right_index=True)
    
    for site in range(1, 9):
        # Get data for this site
        s_df = df[['CZ%02d_RESA' % site, 'CZ%02d_Orig' % site]]
        s_df.dropna(how='any', inplace=True)
        
        # Plot
        s_df.plot(ax=axes[row, site-1], legend=False)
        axes[row, site-1].set_title('CZ%02d %s' % (site, param))

plt.tight_layout()        
plt.show()









    



C:\Data\64_Bit_WinPython\python-2.7.10.amd64\lib\site-packages\ipykernel\__main__.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

This looks proimising: there is only one colour visible on these plots, but two lines are being plotted, so I assume the results from RESA are overlapping exactly with those from Vladimir's spreadsheet.

4. Trends analysis

The final step for this notebook is to update my trends code and re-run the analysis. Note that there are still some known issues with the data, so these results should not be taken as definitive, but hopefully they are a step in the right direction.

4.1. Modify trends code

My previous trends notebook left a number of loose ends, which need tidying up here:

My current code averages samples taken from 0 and 0.5 m depth. In his e-mail from 21/07/2016 at 14:54, Don recommended only using samples from 0.5 m. Changing this will require some careful coding, as many of the samples seem to have depth = 0 as a kind of default, so if I select only those records where depth = 0.5, I'll end up missing a large chunk of the data. My algorithm needs to first select all the surface water data, and then choose the 0.5 m sample only in cases where there are duplicates (i.e. results for both 0.5 m and 0.0 m).
My original code used reference ratios for sea-salt corrections which I back-calculated from RESA2. In an e-mail received 16/08/2016 at 14:04 Heleen provided the true values used for the original analysis. I think these agree with my back-calculated values, but this needs checking.
My code does not currently calculate trends for the correct parameters. Based on Heleen's e-mail (see above), the main quantities of interest are:
- ESO4 ($μeq/l$)
- ESO4X ($μeq/l$)
- ECl ($μeq/l$)
- ESO4_ECl ($μeq/l$)
- ECa_EMg ($μeq/l$)
- ECaX_EMgX ($μeq/l$)
- ENO3 ($μeq/l$)
- TOC ($mgC/l$)
- Al ($mg/l$)
- ANC ($= Ca + Mg + K + Na + NH_4 - Cl - SO_4 - NO_3$, all in $μeq/l$)
- ALK ($μeq/l$)
- HPLUS ($μeq/l$)
Deal with duplicates. The previous notebook highlighted some other duplicates for which I couldn't find an explanation (see section 3.2.3). Suzanne hasn't responsed to my e-mails, so I'm not in a position to correct these issues at this stage. For now, I propose to take the most recently uploaded of the duplicate values, as these are usually the ones for which I can find a "paper-trail".
Choose a method for dealing with values at the detection limit. There are no hard-and-fast rules here and the choice is unlikely to dramatically influence trend results calculated using non-parametric statistics (i.e. Sen's slope calculated from annual medians). At present, my code simply substitutes the detection limit value, which is the easiest and most transparent approach. I can change this if there are strong arguments for using an alternative method, though.

4.1.1. Depths

Some more careful checking of the database shows that this is actually a very minor problem. Records with measurements from both 0 m and 0.5 m only occur for a few of the Norwegian sites back in the 1970s. For these measurements, the chemistry values are the same regardless of which depth is chosen, so the problem is actually negligible.

The code below finds all of the duplicate depth associated with all the ICPW trends sites.The issue is so limitied that I don't think it needs further consideration.



In [5]:

    
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CA_ATL',
             'ICPW_TOCTRENDS_2015_CA_DO',
             'ICPW_TOCTRENDS_2015_CA_ICPW',
             'ICPW_TOCTRENDS_2015_CA_NF',
             'ICPW_TOCTRENDS_2015_CA_QU',
             'ICPW_TOCTRENDS_2015_CZ',
             'ICPW_TOCTRENDS_2015_Cz2',
             'ICPW_TOCTRENDS_2015_FI',
             'ICPW_TOCTRENDS_2015_NO',
             'ICPW_TOCTRENDS_2015_SE',
             'ICPW_TOCTRENDS_2015_UK',
             'ICPW_TOCTRENDS_2015_US_LTM',
             'ICPWaters Ca']

sql = ('SELECT station_id, station_code '
       'FROM resa2.stations '
       'WHERE station_id IN (SELECT UNIQUE(station_id) '
                            'FROM resa2.projects_stations '
                            'WHERE project_id IN (SELECT project_id '
                                                 'FROM resa2.projects '
                                                 'WHERE project_name IN %s))'
                                                 % str(tuple(proj_list)))    
stn_df = pd.read_sql(sql, engine)
    
    
sql = ("SELECT water_sample_id, station_id, sample_date "
       "FROM resa2.water_samples "
       "WHERE station_id IN %s "
       "AND depth1 <= 1 "
       "AND depth2 <= 1" % str(tuple(stn_df['station_id'].values)))

df = pd.read_sql(sql, engine)

print 'Number of duplicated records:', df.duplicated(subset=['station_id', 'sample_date']).sum()
df[df.duplicated(subset=['station_id', 'sample_date'], keep=False)]









    



Number of duplicated records: 12






    Out[5]:






  
    
      
      water_sample_id
      station_id
      sample_date
    
  
  
    
      117
      834
      103
      1974-10-29
    
    
      118
      136691
      103
      1974-10-29
    
    
      153
      424
      104
      1974-11-05
    
    
      154
      136693
      104
      1974-11-05
    
    
      225
      743
      107
      1974-10-29
    
    
      226
      136695
      107
      1974-10-29
    
    
      3600
      791
      115
      1974-10-16
    
    
      3601
      136697
      115
      1974-10-16
    
    
      3763
      1266
      120
      1974-10-13
    
    
      3764
      19874
      120
      1974-10-13
    
    
      4157
      1285
      135
      1974-10-13
    
    
      4158
      136699
      135
      1974-10-13
    
    
      4912
      1566
      158
      1975-03-19
    
    
      4913
      20512
      158
      1975-03-19
    
    
      4944
      928
      161
      1974-10-12
    
    
      4945
      136703
      161
      1974-10-12
    
    
      4979
      1417
      162
      1974-10-18
    
    
      4980
      20177
      162
      1974-10-18
    
    
      5469
      1714
      179
      1975-03-28
    
    
      5470
      20847
      179
      1975-03-28
    
    
      5502
      915
      180
      1974-10-12
    
    
      5503
      136705
      180
      1974-10-12
    
    
      5623
      487
      183
      1974-10-31
    
    
      5624
      136707
      183
      1974-10-31

4.1.2. Reference ratios

These have been checked and the values in my code agree with those in Heleen's e-mail.

4.1.3. Additional parameters

I have modified the code to include ANC, calculated as

$$ANC = Ca + Mg + K + Na + NH_4 - Cl - SO_4 - NO_3$$

where all values are in $μeq/l$ (as per Heleen's e-mail - see above). However, note that not all sites have complete data for all these chemical parameters e.g. the Newfoundland sites do not report NH4. I believe NH4 is usually a fairly small component of ANC, so in my code I've decided to assume $NH_4 = 0$ unless it's explicitly specified. This means that I can still calculate ANC for the Newfoundland sites (which would otherwise all return "NoData"). Is this reasonable? Are there any other chemical species that can safely be ignored when they're not reported? Check this with Heleen.

Including alkalinity is not straightforward, as there are many different methods and some of them are not fully described in the database. To make use of the alkalinity data I need to convert everything to a common scale and units (preferably in $\mu eq/l$). This can be done for some of the methods using the metadata in RESA2, but for others this is not possible. Having queried this with Heleen, she has suggested ignoring alkalinity for now (see e-mail received 17/10/2016 at 13:19), but this is something to return to later, possibly with help from Øyvind Garmo regarding how to align the various methods. For future reference, the methods in question are listed here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Data_Summaries\alkalin_method_issues.xlsx

4.1.4. Duplicates

Where duplicates are reported which cannot be explained by different sampling depths (e.g. 0 m and 0.5 m, as above), my code now selects the value that has been added to the database most recently. There is relatively little justification for this decision: I haven't been able to get to the bottom of these issues and this step is really just a fudge (my code previously averaged these measurements, which is also dodgy). The main reason for choosing the most recent database addition is that I can generally find these values in the raw data, whereas the older measurements aren't so obvious. However, it's possible that this is more a reflection of my (still limited) knowledge of the data on the NIVA network, rather than because the more recent data is "more correct".

4.1.5. Detection limits

Heleen is happy to replace values below the detection limit with the detection limit itself, which is what my code currently does. See e-mail received 07/10/2016 at 15:18.

5. New trends analysis

This section tests the new code by applying it to the data for all years (i.e. the full adataset for each site).



In [64]:

    
# Import code for trends analysis
resa2_trends_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
                     r'\Python\icpw\toc_trends_analysis.py')

resa2_trends = imp.load_source('toc_trends_analysis', resa2_trends_path)

The list of projects to consider has been previously agreed with Heleen (see section 3 of this notebook).



In [65]:

    
# User input
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CA_ATL',
             'ICPW_TOCTRENDS_2015_CA_DO',
             'ICPW_TOCTRENDS_2015_CA_ICPW',
             'ICPW_TOCTRENDS_2015_CA_NF',
             'ICPW_TOCTRENDS_2015_CA_QU',
             'ICPW_TOCTRENDS_2015_CZ',
             'ICPW_TOCTRENDS_2015_Cz2',
             'ICPW_TOCTRENDS_2015_FI',
             'ICPW_TOCTRENDS_2015_NO',
             'ICPW_TOCTRENDS_2015_SE',
             'ICPW_TOCTRENDS_2015_UK',
             'ICPW_TOCTRENDS_2015_US_LTM',
             'ICPWaters Ca']

# Output paths
plot_fold = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
             r'\Results\Trends_Plots')
res_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\Results\res.csv')
dup_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\Results\dup.csv')
nd_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
          r'\Results\nd.csv')



In [5]:

    
# Run analysis 
res_df, dup_df, nd_df = resa2_trends.run_trend_analysis(proj_list, engine,
                                                        st_yr=None, end_yr=None,
                                                        plot=True, fold=plot_fold)

# Delete mk_std_dev col as not relevant here
del res_df['mk_std_dev']

# Write output
res_df.to_csv(res_csv, index=False)
dup_df.to_csv(dup_csv, index=False)
nd_df.to_csv(nd_csv, index=False)

res_df.head(14).sort_values(by='par_id')









    



Extracting data from RESA2...
    The database contains duplicate values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Some stations have no relevant data in the period specified. Their IDs are returned in a separate dataframe.

    Done.

Converting units and applying sea-salt correction...
    Done.

Calculating statistics...
    Data series for Al at site 101 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 102 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 103 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 104 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 107 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 109 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 112 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 115 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 118 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 119 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 120 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 121 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 122 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 123 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 128 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 132 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 134 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 135 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 144 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 146 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 147 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 150 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 156 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 158 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 161 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 162 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 163 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 166 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 168 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 170 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 173 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 176 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 179 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 180 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 181 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 182 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 183 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 185 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 192 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 193 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 196 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 12081 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 23468 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 23546 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36547 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36560 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36733 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36739 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36750 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36753 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36793 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36797 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 37063 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Done.

Finished.






    Out[5]:






  
    
      
      station_id
      par_id
      period
      non_missing
      mean
      median
      std_dev
      mk_stat
      norm_mk_stat
      mk_p_val
      trend
      sen_slp
    
  
  
    
      10
      100
      ANC
      1986-2015
      28
      2.133307
      3.741420
      18.672454
      322.0
      6.341843
      2.270331e-10
      increasing
      2.041432
    
    
      0
      100
      Al
      1986-2015
      1
      276.433655
      276.433655
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      0
      101
      Al
      1986-2015
      3
      134.411234
      133.233703
      15.034625
      1.0
      0.000000
      1.000000e+00
      no trend
      2.395185
    
    
      9
      100
      ECaX_EMgX
      1986-2015
      28
      25.110187
      24.922857
      4.404364
      -158.0
      -3.101774
      1.923650e-03
      decreasing
      -0.296140
    
    
      8
      100
      ECa_EMg
      1986-2015
      28
      29.318452
      29.583333
      4.774226
      -158.0
      -3.102985
      1.915794e-03
      decreasing
      -0.333333
    
    
      4
      100
      ECl
      1986-2015
      28
      18.061224
      17.142857
      3.572260
      -120.0
      -2.369597
      1.780750e-02
      decreasing
      -0.184694
    
    
      2
      100
      EH
      1986-2015
      28
      11.754885
      10.368577
      5.065416
      -214.0
      -4.213075
      2.519175e-05
      decreasing
      -0.381853
    
    
      2
      101
      EH
      1986-2015
      30
      2.038329
      1.318257
      1.476923
      -302.0
      -5.376715
      7.585719e-08
      decreasing
      -0.104936
    
    
      5
      100
      ENO3
      1986-2015
      28
      2.418367
      2.000000
      1.526747
      41.0
      0.791033
      4.289247e-01
      no trend
      0.028571
    
    
      3
      100
      ESO4
      1986-2015
      28
      37.526786
      31.250000
      18.794670
      -327.0
      -6.446919
      1.141465e-10
      decreasing
      -2.001984
    
    
      6
      100
      ESO4X
      1986-2015
      28
      35.666480
      29.778571
      18.611860
      -329.0
      -6.481403
      9.087331e-11
      decreasing
      -1.945556
    
    
      7
      100
      ESO4_ECl
      1986-2015
      28
      55.588010
      46.943452
      20.825615
      -305.0
      -6.007154
      1.888077e-09
      decreasing
      -2.122321
    
    
      1
      100
      TOC
      1986-2015
      28
      6.053929
      6.000000
      1.277101
      221.0
      4.351249
      1.353642e-05
      increasing
      0.100000
    
    
      1
      101
      TOC
      1986-2015
      30
      3.462000
      3.500000
      0.724533
      218.0
      3.882850
      1.032394e-04
      increasing
      0.059565

6. Data issues

The code seems to be working correctly, but I still have concerns about some of the values in RESA. Looking at the output (res.csv), the following issues are fairly obvious:

A number of the Swedish stations have ridiculously high EH+ (i.e. low pH). They can be idenified with the following SQL:

 SELECT * FROM RESA2.STATIONS
 WHERE STATION_ID IN (36797, 36733, 36753, 36793, 36588, 36660, 36578, 
                      36750, 36670, 36825, 36584, 36680, 36788, 36575,
                      36813, 36636, 36690, 36592, 36675, 36826, 36711,
                      36723, 36731, 36739);

As far as I can tell, my trends code is working correctly, but the values in RESA2 for these sites must surely be wrong (pH is often < 0.03)? Try to trace this.

Sites 37124 and 37129 also have very high EH+/low pH. A quick check suggets that zeros have been entered into RESA instead of NoData in a few cases - this needs correcting.
Overall, the entire dataset would benefit from being checked by someone with a better feel for realistic acid chemistry values than me.

### 6.1. Correct sites 37124 and 37129

I've removed the zeros for pH from these datasets, as they've obviously been entered in error.

### 6.2. Correct Swedish sites

Where have the very low pH values for the 24 Swedish sites come from? The raw data from the Focal Centre for these stations is here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Tilsendte data fra Focalsentere\Sweden\EKSTRA2015

These files contain sensible values for pH (usually around 7), so what's gone wrong? The filenames ending with the letters “ED” denote spreadsheets created by Tore for uploading into the database. He usually uses the ICPW template for this, but in this case he’s done something different. Strangely, pH seems to be missing from the Lakesfrom 1988 3y sheet of the "ED" files and it looks as though something has gone wrong with the upload process. In fact, it looks suspiciously like the values for Pb (in ug/l) have been entered as pH by mistake.

Correcting this is fiddly, because I suspect the problem actually affects a large number of Swedish sites, not just the ones listed above. The difficulty is that these errors aren't obvious in cases where the Pb concentrations are high enough to be sensibly interpreted as pH.

As a start, I've copied the correct pH data from the raw data file over to:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016\Replies\sweden\correct_ph_error.xls

Let's see how this compares to what's in the database.



In [3]:

    
# Read data from template
raw_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
           r'\Replies\sweden\correct_ph_error.xls')
raw_df = pd.read_excel(raw_xls, sheetname='Data')
raw_df.head()









    Out[3]:






  
    
      
      Code
      Name
      Date
      pH
      Pb
    
  
  
    
      0
      622410-135589
      Fåglasjön
      1988-03-15
      6.15
      NaN
    
    
      1
      622410-135589
      Fåglasjön
      1988-08-23
      6.98
      NaN
    
    
      2
      622410-135589
      Fåglasjön
      1988-11-16
      6.55
      NaN
    
    
      3
      622410-135589
      Fåglasjön
      1989-02-28
      6.53
      NaN
    
    
      4
      622410-135589
      Fåglasjön
      1989-11-07
      6.57
      NaN



In [4]:

    
# Get unique list of site codes
stn_list = raw_df['Code'].unique()

# Get station_ids
sql = ('SELECT * FROM resa2.stations '
       'WHERE station_code IN %s' % str(tuple(stn_list)))
stn_df = pd.read_sql(sql, engine)

# Decode special characters from `windows-1252` encoding to unicode
stn_df['station_name'] = stn_df['station_name'].str.decode('windows-1252')

stn_df.head()









    Out[4]:






  
    
      
      station_id
      station_code
      station_name
      lake_or_river
      lake_river_name
      latitude
      longitude
      altitude
      map_sheet
      country_no
      ...
      catchment_peat_area
      catchment_water_area
      station_description
      entered_by
      entered_date
      blr
      old_id
      utmn
      utme
      zone
    
  
  
    
      0
      36611
      622410-135589
      Fåglasjön
      L
      None
      55.521853
      13.311919
      107
      None
      46
      ...
      13.938492
      19.593254
      None
      TOH
      2015-06-01 10:07:22
      None
      None
      None
      None
      None
    
    
      1
      36765
      632515-146675
      Hjärtsjön
      L
      None
      57.006072
      13.579214
      275
      None
      46
      ...
      10.140524
      19.369540
      None
      TOH
      2015-06-01 10:07:27
      None
      None
      None
      None
      None
    
    
      2
      36767
      650061-142276
      Humsjön
      L
      None
      56.876783
      14.928576
      130
      None
      46
      ...
      5.251641
      22.100656
      None
      TOH
      2015-06-01 10:07:27
      None
      None
      None
      None
      None
    
    
      3
      36768
      667151-149602
      Hällsjön
      L
      None
      57.050182
      15.257347
      168
      None
      46
      ...
      2.671119
      13.355593
      None
      TOH
      2015-06-01 10:07:27
      None
      None
      None
      None
      None
    
    
      4
      36769
      704955-159090
      Hällvattnet
      L
      None
      57.867520
      18.840905
      216
      None
      46
      ...
      12.405116
      8.455629
      None
      TOH
      2015-06-01 10:07:27
      None
      None
      None
      None
      None
    
  

5 rows × 30 columns



In [14]:

    
# Get all samples from these sites
sql = ('SELECT * FROM resa2.water_samples '
       'WHERE station_id IN %s' % str(tuple(stn_df['station_id'])))
samp_df = pd.read_sql(sql, engine)

samp_df.head()









    Out[14]:






  
    
      
      water_sample_id
      station_id
      sample_date
      depth1
      depth2
      project_id
      wl_testno
      wl_serialno
      wl_marking
      wl_ident
      remark
      entered_by
      entered_date
    
  
  
    
      0
      549209
      36561
      1988-06-21
      0
      0
      None
      None
      None
      None
      None
      None
      TOH
      24.04.2015
    
    
      1
      549210
      36561
      1988-08-10
      0
      0
      None
      None
      None
      None
      None
      None
      TOH
      24.04.2015
    
    
      2
      549211
      36561
      1988-09-12
      0
      0
      None
      None
      None
      None
      None
      None
      TOH
      24.04.2015
    
    
      3
      549212
      36561
      1989-03-14
      0
      0
      None
      None
      None
      None
      None
      None
      TOH
      24.04.2015
    
    
      4
      549213
      36561
      1989-06-14
      0
      0
      None
      None
      None
      None
      None
      None
      TOH
      24.04.2015



In [17]:

    
# Get all the pH data for these sites
sql = ('SELECT * FROM resa2.water_chemistry_values2 '
       'WHERE sample_id IN (SELECT water_sample_id '
                           'FROM resa2.water_samples '
                           'WHERE station_id IN %s) '
       'AND method_id = 10268' % str(tuple(stn_df['station_id'])))
wc_df = pd.read_sql(sql, engine)

wc_df.head()









    Out[17]:






  
    
      
      value_id
      sample_id
      method_id
      value
      flag1
      flag2
      approved
      remark
      entered_by
      entered_date
      detection_limit
      labware_status
    
  
  
    
      0
      5418659
      549209
      10268
      6.84
      None
      None
      YES
      None
      TOH
      2015-03-25 12:19:23
      None
      None
    
    
      1
      5418660
      549210
      10268
      6.72
      None
      None
      YES
      None
      TOH
      2015-03-25 12:19:23
      None
      None
    
    
      2
      5418661
      549211
      10268
      7.19
      None
      None
      YES
      None
      TOH
      2015-03-25 12:19:23
      None
      None
    
    
      3
      5418662
      549212
      10268
      7.20
      None
      None
      YES
      None
      TOH
      2015-03-25 12:19:24
      None
      None
    
    
      4
      5418663
      549213
      10268
      6.97
      None
      None
      YES
      None
      TOH
      2015-03-25 12:19:24
      None
      None



In [46]:

    
# Join samples
df = pd.merge(wc_df, samp_df, how='left',
              left_on='sample_id', 
              right_on='water_sample_id')

# Join stations
df = pd.merge(df, stn_df, how='left',
              left_on='station_id',
              right_on='station_id')

# Join raw data
df = pd.merge(df, raw_df, how='left',
              left_on=['station_code', 'sample_date'],
              right_on=['Code', 'Date'])

# Extract columns of interest
df = df[['station_id', 'station_code', 'station_name', 'sample_date', 
         'sample_id', 'method_id', 'flag1', 'value', 'pH', 'Pb']]

df.head(10)









    Out[46]:






  
    
      
      station_id
      station_code
      station_name
      sample_date
      sample_id
      method_id
      flag1
      value
      pH
      Pb
    
  
  
    
      0
      36561
      758208-161749
      Abiskojaure
      1988-06-21
      549209
      10268
      None
      6.84
      NaN
      NaN
    
    
      1
      36561
      758208-161749
      Abiskojaure
      1988-08-10
      549210
      10268
      None
      6.72
      6.72
      NaN
    
    
      2
      36561
      758208-161749
      Abiskojaure
      1988-09-12
      549211
      10268
      None
      7.19
      7.19
      NaN
    
    
      3
      36561
      758208-161749
      Abiskojaure
      1989-03-14
      549212
      10268
      None
      7.20
      7.20
      NaN
    
    
      4
      36561
      758208-161749
      Abiskojaure
      1989-06-14
      549213
      10268
      None
      6.97
      NaN
      NaN
    
    
      5
      36561
      758208-161749
      Abiskojaure
      1989-08-23
      549214
      10268
      None
      7.08
      7.08
      NaN
    
    
      6
      36561
      758208-161749
      Abiskojaure
      1989-10-11
      549215
      10268
      None
      7.12
      7.12
      NaN
    
    
      7
      36561
      758208-161749
      Abiskojaure
      1990-04-17
      549216
      10268
      None
      6.96
      6.96
      NaN
    
    
      8
      36561
      758208-161749
      Abiskojaure
      1990-08-14
      549217
      10268
      None
      6.64
      6.64
      NaN
    
    
      9
      36561
      758208-161749
      Abiskojaure
      1990-09-11
      549218
      10268
      None
      7.03
      7.03
      NaN

The value column in the above table shows pH measurements according to RESA2. The pH and Pb columns show the values in the raw data. It is clear that the database values agree with those in the raw data in many places (i.e. the database is correct), which only makes this issue stranger: I would expect all of this data to be uploaded in a single go, so it's surprising that the results are correct for some sites but not for others. I'm not sure what's going on here!

Nevertheless, there are 310 occasions where values for Pb have been entered as pH by mistake.



In [47]:

    
print 'Number of occasions where Pb entered instead of pH:', len(df.query('value != pH').dropna(subset=['pH',]))

df2 = df.query('value != pH').dropna(subset=['pH',])
df2.head()









    



Number of occasions where Pb entered instead of pH: 310






    Out[47]:






  
    
      
      station_id
      station_code
      station_name
      sample_date
      sample_id
      method_id
      flag1
      value
      pH
      Pb
    
  
  
    
      1220
      36575
      662682-132860
      �rvattnet
      1998-10-27
      594988
      10268
      None
      0.23
      5.41
      0.23
    
    
      1221
      36575
      662682-132860
      �rvattnet
      1999-11-02
      594990
      10268
      None
      0.25
      5.28
      0.25
    
    
      1222
      36575
      662682-132860
      �rvattnet
      2001-10-17
      594996
      10268
      None
      0.47
      5.43
      0.47
    
    
      1223
      36575
      662682-132860
      �rvattnet
      2000-10-18
      594993
      10268
      None
      0.36
      5.33
      0.36
    
    
      1224
      36575
      662682-132860
      �rvattnet
      2002-10-22
      594999
      10268
      None
      0.36
      5.25
      0.36

The next step is to loop over these 310 records, replacing the numbers in the value column with those in the pH column. Note that a few of the Pb entries were associated with "less than" flags, so these need clearing as well.



In [55]:

    
df2.query('flag1 == "<"')









    Out[55]:






  
    
      
      station_id
      station_code
      station_name
      sample_date
      sample_id
      method_id
      flag1
      value
      pH
      Pb
    
  
  
    
      5655
      36797
      731799-151196
      Stor-Tjultr�sket
      2006-10-11
      597377
      10268
      <
      0.02
      7.32
      <0,02
    
    
      5656
      36797
      731799-151196
      Stor-Tjultr�sket
      2007-10-04
      597380
      10268
      <
      0.02
      7.44
      <0,02
    
    
      5657
      36797
      731799-151196
      Stor-Tjultr�sket
      2010-09-25
      597389
      10268
      <
      0.02
      7.31
      <0,02



In [61]:

    
# Loop over rows
for index, row in df2.iterrows():
    # Get data
    samp_id = row['sample_id']
    ph = row['pH']
    
    # Update chem table
    sql = ("UPDATE resa2.water_chemistry_values2 "
           "SET value = %s, "
               "flag1 = NULL "
           "WHERE sample_id = %s "
           "AND method_id = 10268"
           % (ph, samp_id))

    result = conn.execute(sql)



In [62]:

    
# Check changes have taken effect
sql = ('SELECT * FROM resa2.water_chemistry_values2 '
       'WHERE sample_id = 597377 '
       'AND method_id = 10268')

df3 = pd.read_sql(sql, engine)

df3









    Out[62]:






  
    
      
      value_id
      sample_id
      method_id
      value
      flag1
      flag2
      approved
      remark
      entered_by
      entered_date
      detection_limit
      labware_status
    
  
  
    
      0
      6124953
      597377
      10268
      7.32
      None
      None
      YES
      None
      JES
      2016-10-19 17:36:55
      None
      None

Finally, re-run the trends code to make sure the EH+ values are now more sensible.



In [68]:

    
# Run analysis 
res_df, dup_df, nd_df = resa2_trends.run_trend_analysis(proj_list, engine,
                                                        st_yr=None, end_yr=None,
                                                        plot=False, fold=None)

# Delete mk_std_dev col as not relevant here
del res_df['mk_std_dev']

res_df.sort_values(by='mean', ascending=False).head()









    



Extracting data from RESA2...
    The database contains duplicate values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Some stations have no relevant data in the period specified. Their IDs are returned in a separate dataframe.

    Done.

Converting units and applying sea-salt correction...
    Done.

Calculating statistics...
    Data series for Al at site 101 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 102 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 103 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 104 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 107 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 109 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 112 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 115 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 118 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 119 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 120 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 121 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 122 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 123 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 128 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 132 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 134 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 135 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 144 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 146 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 147 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 150 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 156 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 158 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 161 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 162 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 163 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 166 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 168 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 170 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 173 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 176 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 179 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 180 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 181 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 182 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 183 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 185 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 192 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 193 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 196 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 12081 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 23468 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 23546 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36547 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 36560 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36733 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36739 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36750 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36753 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36793 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for EH at site 36797 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Data series for Al at site 37063 has fewer than 10 non-null values. Significance estimates may be unreliable.
    Done.

Finished.






    Out[68]:






  
    
      
      station_id
      par_id
      period
      non_missing
      mean
      median
      std_dev
      mk_stat
      norm_mk_stat
      mk_p_val
      trend
      sen_slp
    
  
  
    
      0
      37834
      Al
      1989-2015
      27
      1006.166667
      960.000000
      223.482532
      -207.0
      -4.296331
      0.000017
      decreasing
      -20.000000
    
    
      7
      36848
      ESO4_ECl
      1990-2012
      23
      737.357660
      719.821429
      94.117998
      -89.0
      -2.324120
      0.020119
      decreasing
      -6.019918
    
    
      7
      36734
      ESO4_ECl
      1988-2013
      26
      714.903226
      717.179886
      143.143990
      23.0
      0.484914
      0.627737
      no trend
      2.861632
    
    
      0
      37830
      Al
      1984-2015
      31
      704.047419
      670.000002
      184.115978
      -183.0
      -3.093348
      0.001979
      decreasing
      -12.420000
    
    
      0
      36567
      Al
      1988-2013
      25
      631.800000
      585.000000
      245.445471
      -190.0
      -4.414089
      0.000010
      decreasing
      -24.722222

Good. (Previously, the mean and median columns included values close to $1.10^6$ for EH, corresponding to pH values of approximately 0. In the newly cleaned data, the largest value in the mean column is 1006, this time associated with Al. I don't know enough about Al to say whether this is reasonable, but it's at least a step in the right direction).

	station_code	station_name	suggested_name	active	core	trend	nfc_code	nfc_name	latitude	longitude	elevation
0	US24	New York, Catskill Mnt., Rondout Creek	Rondout Creek, Catskills	YES	YES	YES	01364959	RONDOUT CREEK	41.9367	-74.3764	503.0
1	US25	W Br Neversink R At Winnisook, Catskills	West Branch Neversink River at Winnisook, Cats...	YES	YES	YES	01434021	W BR NEVERSINK R AT WINNISOOK	41.9725	-74.4483	816.0
2	US26	Biscuit Brook, Catskills	Biscuit Brook, Catskills	YES	YES	YES	01434025	BISCUIT BROOK	42.0111	-74.4147	634.0
3	US23	New York, Catskill Mnt., E. Branch Neversink, ...	East Branch Neversink River, Catskills	YES	YES	YES	0143400680	EAST BRANCH NEVERSINK, HEADWAT	41.9869	-74.5031	652.0
4	US27	Little Hope Pond, Adirondacks	Little Hope Pond, Adirondacks	YES	YES	NO, road salt contamination	020058O	LITTLE HOPE POND	44.5158	-74.1252	521.0

	station_id	station_code	station_name	latitude	longitude
0	23659	US24	New York, Catskill Mnt., Rondout Creek	41.9600	-74.4500
1	23658	US23	New York, Catskill Mnt., E. Branch Neversink, ...	43.1800	-74.5000
2	23660	US25	W Br Neversink R At Winnisook, Catskills	42.0111	-74.4147
3	23661	US26	Biscuit Brook, Catskills	41.9869	-74.5031
4	23662	US27	Little Hope Pond, Adirondacks	44.5158	-74.1252

	station_id	station_code	station_name	latitude	longitude
0	36983	X15:020059O	Big Hope Pond, Adirondacks	44.5119	-74.1250
1	36984	X15:020138O	East Copperas Pond, Adirondacks	44.3119	-74.3722
2	36985	X15:020188E	Sunday Pond, Adirondacks	44.3447	-74.3005
3	36986	X15:020197E	Sochia Pond, Adirondacks	44.3522	-74.2947
4	36987	X15:020265O	Marcy Dam Pond, Adirondacks	44.1589	-73.9530

	water_sample_id	station_id	sample_date
117	834	103	1974-10-29
118	136691	103	1974-10-29
153	424	104	1974-11-05
154	136693	104	1974-11-05
225	743	107	1974-10-29
226	136695	107	1974-10-29
3600	791	115	1974-10-16
3601	136697	115	1974-10-16
3763	1266	120	1974-10-13
3764	19874	120	1974-10-13
4157	1285	135	1974-10-13
4158	136699	135	1974-10-13
4912	1566	158	1975-03-19
4913	20512	158	1975-03-19
4944	928	161	1974-10-12
4945	136703	161	1974-10-12
4979	1417	162	1974-10-18
4980	20177	162	1974-10-18
5469	1714	179	1975-03-28
5470	20847	179	1975-03-28
5502	915	180	1974-10-12
5503	136705	180	1974-10-12
5623	487	183	1974-10-31
5624	136707	183	1974-10-31

	station_id	par_id	period	non_missing	mean	median	std_dev	mk_stat	norm_mk_stat	mk_p_val	trend	sen_slp
10	100	ANC	1986-2015	28	2.133307	3.741420	18.672454	322.0	6.341843	2.270331e-10	increasing	2.041432
0	100	Al	1986-2015	1	276.433655	276.433655	NaN	NaN	NaN	NaN	NaN	NaN
0	101	Al	1986-2015	3	134.411234	133.233703	15.034625	1.0	0.000000	1.000000e+00	no trend	2.395185
9	100	ECaX_EMgX	1986-2015	28	25.110187	24.922857	4.404364	-158.0	-3.101774	1.923650e-03	decreasing	-0.296140
8	100	ECa_EMg	1986-2015	28	29.318452	29.583333	4.774226	-158.0	-3.102985	1.915794e-03	decreasing	-0.333333
4	100	ECl	1986-2015	28	18.061224	17.142857	3.572260	-120.0	-2.369597	1.780750e-02	decreasing	-0.184694
2	100	EH	1986-2015	28	11.754885	10.368577	5.065416	-214.0	-4.213075	2.519175e-05	decreasing	-0.381853
2	101	EH	1986-2015	30	2.038329	1.318257	1.476923	-302.0	-5.376715	7.585719e-08	decreasing	-0.104936
5	100	ENO3	1986-2015	28	2.418367	2.000000	1.526747	41.0	0.791033	4.289247e-01	no trend	0.028571
3	100	ESO4	1986-2015	28	37.526786	31.250000	18.794670	-327.0	-6.446919	1.141465e-10	decreasing	-2.001984
6	100	ESO4X	1986-2015	28	35.666480	29.778571	18.611860	-329.0	-6.481403	9.087331e-11	decreasing	-1.945556
7	100	ESO4_ECl	1986-2015	28	55.588010	46.943452	20.825615	-305.0	-6.007154	1.888077e-09	decreasing	-2.122321
1	100	TOC	1986-2015	28	6.053929	6.000000	1.277101	221.0	4.351249	1.353642e-05	increasing	0.100000
1	101	TOC	1986-2015	30	3.462000	3.500000	0.724533	218.0	3.882850	1.032394e-04	increasing	0.059565

	Code	Name	Date	pH	Pb
0	622410-135589	Fåglasjön	1988-03-15	6.15	NaN
1	622410-135589	Fåglasjön	1988-08-23	6.98	NaN
2	622410-135589	Fåglasjön	1988-11-16	6.55	NaN
3	622410-135589	Fåglasjön	1989-02-28	6.53	NaN
4	622410-135589	Fåglasjön	1989-11-07	6.57	NaN

	station_id	station_code	station_name	lake_or_river	lake_river_name	latitude	longitude	altitude	map_sheet	country_no	...	catchment_peat_area	catchment_water_area	station_description	entered_by	entered_date	blr	old_id	utmn	utme	zone
0	36611	622410-135589	Fåglasjön	L	None	55.521853	13.311919	107	None	46	...	13.938492	19.593254	None	TOH	2015-06-01 10:07:22	None	None	None	None	None
1	36765	632515-146675	Hjärtsjön	L	None	57.006072	13.579214	275	None	46	...	10.140524	19.369540	None	TOH	2015-06-01 10:07:27	None	None	None	None	None
2	36767	650061-142276	Humsjön	L	None	56.876783	14.928576	130	None	46	...	5.251641	22.100656	None	TOH	2015-06-01 10:07:27	None	None	None	None	None
3	36768	667151-149602	Hällsjön	L	None	57.050182	15.257347	168	None	46	...	2.671119	13.355593	None	TOH	2015-06-01 10:07:27	None	None	None	None	None
4	36769	704955-159090	Hällvattnet	L	None	57.867520	18.840905	216	None	46	...	12.405116	8.455629	None	TOH	2015-06-01 10:07:27	None	None	None	None	None

	water_sample_id	station_id	sample_date	project_id	wl_testno	wl_serialno	wl_marking	wl_ident	remark	entered_by	entered_date
0	549209	36561	1988-06-21	None	None	None	None	None	None	TOH	24.04.2015
1	549210	36561	1988-08-10	None	None	None	None	None	None	TOH	24.04.2015
2	549211	36561	1988-09-12	None	None	None	None	None	None	TOH	24.04.2015
3	549212	36561	1989-03-14	None	None	None	None	None	None	TOH	24.04.2015
4	549213	36561	1989-06-14	None	None	None	None	None	None	TOH	24.04.2015

	value_id	sample_id	method_id	value	flag1	flag2	approved	remark	entered_by	entered_date	detection_limit	labware_status
0	5418659	549209	10268	6.84	None	None	YES	None	TOH	2015-03-25 12:19:23	None	None
1	5418660	549210	10268	6.72	None	None	YES	None	TOH	2015-03-25 12:19:23	None	None
2	5418661	549211	10268	7.19	None	None	YES	None	TOH	2015-03-25 12:19:23	None	None
3	5418662	549212	10268	7.20	None	None	YES	None	TOH	2015-03-25 12:19:24	None	None
4	5418663	549213	10268	6.97	None	None	YES	None	TOH	2015-03-25 12:19:24	None	None

	station_id	station_code	station_name	sample_date	sample_id	method_id	flag1	value	pH	Pb
0	36561	758208-161749	Abiskojaure	1988-06-21	549209	10268	None	6.84	NaN	NaN
1	36561	758208-161749	Abiskojaure	1988-08-10	549210	10268	None	6.72	6.72	NaN
2	36561	758208-161749	Abiskojaure	1988-09-12	549211	10268	None	7.19	7.19	NaN
3	36561	758208-161749	Abiskojaure	1989-03-14	549212	10268	None	7.20	7.20	NaN
4	36561	758208-161749	Abiskojaure	1989-06-14	549213	10268	None	6.97	NaN	NaN
5	36561	758208-161749	Abiskojaure	1989-08-23	549214	10268	None	7.08	7.08	NaN
6	36561	758208-161749	Abiskojaure	1989-10-11	549215	10268	None	7.12	7.12	NaN
7	36561	758208-161749	Abiskojaure	1990-04-17	549216	10268	None	6.96	6.96	NaN
8	36561	758208-161749	Abiskojaure	1990-08-14	549217	10268	None	6.64	6.64	NaN
9	36561	758208-161749	Abiskojaure	1990-09-11	549218	10268	None	7.03	7.03	NaN

	station_id	station_code	station_name	sample_date	sample_id	method_id	flag1	value	pH	Pb
1220	36575	662682-132860	�rvattnet	1998-10-27	594988	10268	None	0.23	5.41	0.23
1221	36575	662682-132860	�rvattnet	1999-11-02	594990	10268	None	0.25	5.28	0.25
1222	36575	662682-132860	�rvattnet	2001-10-17	594996	10268	None	0.47	5.43	0.47
1223	36575	662682-132860	�rvattnet	2000-10-18	594993	10268	None	0.36	5.33	0.36
1224	36575	662682-132860	�rvattnet	2002-10-22	594999	10268	None	0.36	5.25	0.36

	station_id	station_code	station_name	sample_date	sample_id	method_id	flag1	value	pH	Pb
5655	36797	731799-151196	Stor-Tjultr�sket	2006-10-11	597377	10268	<	0.02	7.32	<0,02
5656	36797	731799-151196	Stor-Tjultr�sket	2007-10-04	597380	10268	<	0.02	7.44	<0,02
5657	36797	731799-151196	Stor-Tjultr�sket	2010-09-25	597389	10268	<	0.02	7.31	<0,02

	station_id	par_id	period	non_missing	mean	median	std_dev	mk_stat	norm_mk_stat	mk_p_val	trend	sen_slp
0	37834	Al	1989-2015	27	1006.166667	960.000000	223.482532	-207.0	-4.296331	0.000017	decreasing	-20.000000
7	36848	ESO4_ECl	1990-2012	23	737.357660	719.821429	94.117998	-89.0	-2.324120	0.020119	decreasing	-6.019918
7	36734	ESO4_ECl	1988-2013	26	714.903226	717.179886	143.143990	23.0	0.484914	0.627737	no trend	2.861632
0	37830	Al	1984-2015	31	704.047419	670.000002	184.115978	-183.0	-3.093348	0.001979	decreasing	-12.420000
0	36567	Al	1988-2013	25	631.800000	585.000000	245.445471	-190.0	-4.414089	0.000010	decreasing	-24.722222