In [1]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')
Heleen would like updated results for the ICPW trends analysis at the beginning of October. Today is the last day of September and I'm committed to other projects for the next few weeks, so time is getting tight! The aim of this notebook is to get some more-or-less complete results that Heleen can work with while I'm away next week.
Note: There are still a number of significant issues with the database that I am unlikely to have time to fix in the near future. Although the ICPW dataset itself is pretty simple, cleaning up the existing system is not a small job, due to a long history of database structural development and, in particular, the way that RESA2 is coupled to various other systems. I'm not going to attempt to deal with these issues here, but I'll try to describe them below so I can come back in the future. The aim here is to try to produce something useful despite the outstanding known (and unknown?) issues.
Most of the tasks disucssed with John, Don and Heleen back in May have been completed, as described here, here and here. Section 5 of the latter notebook provides a more detailed summary.
On 25/08/2016, John sent a couple of very helpful e-mails in response to my earlier questions regarding the US sites. Unfortunately, comparing John's spreadsheets to the information in RESA2 highlighted a number of further database-wide errors (not specific to the US sites), which are going to be tricky to fix. A notebook describing this work in detail is here, but the main implications are summarised in an e-mail to Heleen (sent 02/09/2016 at 11:42), which is reproduced in part below.
"...it is not possible to reliably extract complete water chemistry time series from RESA2, even for just a single chemical parameter at a single site. This is because, if you choose a site that appears in both the core ICPW programme and in the wider trends analysis (which is most of them), you will find two time series in the database: one associated with 'ICPWaters' and another linked to 'TOC_TRENDS'. These two series will likely have different start and end points and different periods of missing data. There will also be a lot of overlap, and most of the overlapping values will agree, but in a few places they will be substantially different. What's more, at present the database will often report significantly different site properties for the two time series (i.e. different station names, geographic co-ordinates and catchment properties), despite the fact that the samples actually all come from a single location. All this means that if you want to look at a complete water chemistry time series for any site in ICP Waters, the only way to do it at the moment is to extract both series, manually merge them (e.g. in Excel) and then try to figure out which set of site properties is correct.
[...]
At present it is possible for someone using RESA2 to extract two different time series and two sets of site properties for a single location. This is pretty confusing (as we found out back in May), and it also somewhat defeats the point of having a database."
Other outstanding issues include the large number of Swedish sites with no data, problems with the Czech data (in an e-mail received 08/09/2016, Vladimir suggested just deleting and reloading everything) and the poor availability of site metadata for many locations.
Having discussed these issues with Heleen, I've decided to try the following:
Tidy up the US sites based on information provided by John. I'm not going to attempt to merge any datasets at this stage, but I can clean up the site names and add the original USEPA site codes to reduce confusion.
Look at the Czech data sent by Vladimir and decide whether it can and should replace the values currently in the database.
Run the trends analysis using whatever data is currently associated with the TOC_TRENDS
projects. In principle, these datasets were gathered together separately from ICPW for the 2015 reanalysis, which is why they're associated with separate projects in the database. As described in the notebook linked above, the values in these series sometimes do not agree with those reported for the core ICPW projects, but I'm going to ignore this for now as reconciling the differences will take a long time.
In [2]:
# Import custom functions and connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
r'\useful_resa2_code.py')
resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)
engine, conn = resa2_basic.connect_to_resa2()
Based on the results in the previous notebook, I'd like to go through all the US sites in RESA2 and modify the site properties to match the values in John's spreadsheet (which I'm taking to be definitive). In particular, I want to correct the station names and geographic co-ordinates, as well as appending the original USEPA site codes to the station metadata.
We'll start by correcting the core ICPW sites. John's spreadsheet identified 95 sites that should be associated with this project and, following the work in the previous notebook, this now agrees with what's in RESA2. Four of the core ICPW sites in John's spreadsheet are marked NA
in the station_code
column. These sites do now have codes in RESA2, so I've created a new version of John's spreadsheet (U.S.Site.Reconciliation.August.2016_jes.xlsx) with these codes added. The first step is therefore to check that there is a direct match between codes for the 95 core sites in RESA2 and the the 95 sites listed in John's spreadsheet.
In [3]:
# Read John's spreadsheet
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
'\Replies\usa\U.S.Site.Reconciliation.August.2016_jes.xlsx')
j_df = pd.read_excel(in_xls, sheetname='Sheet1')
# Get a subset of columns from John's spreadsheet
us_df = j_df[['Station Code', 'Station name', 'Suggested Name',
'Active', 'INCLUDE IN ICP WATERS DATABASE',
'INCLUDE IN ICP DOC ANALYSIS', 'NFC_SITEID',
'NFC_SITENAME', 'Latitude', 'Longitude', 'Altitude (m)']]
# Rename columns
us_df.columns = ['station_code', 'station_name', 'suggested_name',
'active', 'core', 'trend', 'nfc_code', 'nfc_name',
'latitude', 'longitude', 'elevation']
# Just the core sites
core_df = us_df.query('core == "YES"')
core_df.head()
Out[3]:
In [37]:
# Get station codes associated with 'ICP Waters US' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
'FROM resa2.stations '
'WHERE station_id in (SELECT station_id '
'FROM resa2.projects_stations '
'WHERE project_id = 1679)')
r2_df = pd.read_sql_query(sql, engine)
r2_df.head()
Out[37]:
In [30]:
# Join
df = r2_df.merge(core_df, on='station_code', how='outer',
suffixes=('_r', '_j'))
if len(df) == 95:
print 'All rows match.'
df.head()
Out[30]:
The code above shows that all the sites have been matched correctly, so I can now loop over each site in the ICP Waters US
project, updating the details in the stations table using the information in John's spreadsheet.
I also want to add the original National Focal Centre site code as an additional attribute (as John suggested in his e-mails). The easiest way to do this is to create a new entry (NFC_Code
) in the STATION_PARAMETER_DEFINITIONS
table.
In [48]:
# Loop over sites
for row in df.iterrows():
# Get site properties
stn_id = row[1]['station_id']
name = row[1]['suggested_name']
lat = row[1]['latitude_j']
lon = row[1]['longitude_j']
elev = row[1]['elevation']
nfc = row[1]['nfc_code']
# Update stations table
sql = ("UPDATE resa2.stations "
"SET station_name = '%s', "
"latitude = %s, "
"longitude = %s, "
"altitude = %s "
"WHERE station_id = %s"
% (name, lat, lon, elev, stn_id))
result = conn.execute(sql)
# Update stations_par_values table with NFC code
sql = ("INSERT INTO resa2.stations_par_values "
"(station_id, var_id, value, entered_by, entered_date) "
"VALUES (%s, 321, '%s', 'JES', TO_DATE('2016-09-30', 'YYYY-MM-DD'))"
% (stn_id, nfc))
result = conn.execute(sql)
The next step is to correct the entries for the trends sites. Note that if the database was properly normalised this step wouldn't be necessary, as the 76 trends sites are a sub-set of the 95 core ICPW sites, so the steps described above should cover all relevant sites. However, due to the duplication of sites in the database, it is necessary to do this cleaning twice.
Matching the trends sites turns out to be a bit more difficult, because of inconsistencies in the X15:
prefixes. The first step is to manually add Newbert Pond, Maine
(station_code = US82
) to the ICPW_TOCTRENDS_2015_US_LTM
project, as mentioned at the end of the previous notebook. This should mean that we have 76 sites associated with the trends project, as specified in John's spreadsheet.
In [4]:
# Just the trend sites
trend_df = us_df.query('trend == "YES"')
trend_df.head()
Out[4]:
In [3]:
# Get station codes associated with 'ICPW_TOCTRENDS_2015_US_LTM' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
'FROM resa2.stations '
'WHERE station_id in (SELECT station_id '
'FROM resa2.projects_stations '
'WHERE project_id = 3870)')
r2_df = pd.read_sql_query(sql, engine)
print 'Number of sites:', len(r2_df)
r2_df.head()
Out[3]:
The next question is how well we can match these sites to John's spreadsheet. Based on the analysis in the previous notebook, the answer is, "not very well", because leading and trailing zeros in the original site codes have been truncated (presumably accidently, e.g. in Excel) prior to the X15:
prefix being added. This isn't necessarily a major problem - the station codes used in RESA2 are mostly irrelevant - but I do need to somehow match them to John's spreadsheet and then update the station properties (preferably also adding the original site codes supplied by John, so we don't have to go through all this again later). The code below takes a slightly "heuristic" approach to finding matches.
In [121]:
# Loop over rows from John
for row in trend_df.iterrows():
# Get site properties
nfc_cd = row[1]['nfc_code']
name = row[1]['suggested_name']
lat = row[1]['latitude']
lon = row[1]['longitude']
elev = row[1]['elevation']
# Attempt to find match. Need to add 'X15:' and allow for variants
q_res = r2_df[(r2_df['station_code']=='X15:%s' % nfc_cd) |
(r2_df['station_code']=='X15:%s' % nfc_cd[1:]) |
(r2_df['station_code']=='X15:%s' % nfc_cd[:-1])]
if len(q_res) == 1:
# Single match found. Get stn_id
stn_id = q_res.iloc[0]['station_id']
# Update stations table
sql = ("UPDATE resa2.stations "
"SET station_name = '%s', "
"latitude = %s, "
"longitude = %s, "
"altitude = %s "
"WHERE station_id = %s"
% (name, lat, lon, elev, stn_id))
result = conn.execute(sql)
# Check whether there's already an entry for this site
# in stations_par_values table
sql = ('SELECT * FROM resa2.stations_par_values '
'WHERE station_id = %s '
'AND var_id = 321'
% stn_id)
df = pd.read_sql_query(sql, engine)
if len(df) < 1:
# Update stations_par_values table with NFC code
sql = ("INSERT INTO resa2.stations_par_values "
"(station_id, var_id, value, entered_by, entered_date) "
"VALUES (%s, 321, '%s', 'JES', TO_DATE('2016-09-30', 'YYYY-MM-DD'))"
% (stn_id, nfc_cd))
result = conn.execute(sql)
else:
# Can't get good match
print "Can't match %s." % nfc_cd
This code manages to find unique matches for all but two of the sites, which is a good start. Looking at the site codes for the two exceptions in John's spreadsheet, it seems as though they were previously only associated with core ICPW project and not the broader trends analysis. They were therefore not duplicated when the TOC Trends
projects were created and instead only appear in the database once, using station codes US74
and US82
, respectively (rather than any of the X15:
stuff).
With a bit of luck, I've finally managed to sort out the basic details for the US sites. Let's check.
In [4]:
# Get the NFC site codes
sql = ('SELECT station_id, value AS nfc_code '
'FROM resa2.stations_par_values '
'WHERE var_id =321')
nfc_df = pd.read_sql_query(sql, engine)
# Get station codes associated with 'ICP Waters US' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude, altitude '
'FROM resa2.stations '
'WHERE station_id in (SELECT station_id '
'FROM resa2.projects_stations '
'WHERE project_id = 1679)')
core_df = pd.read_sql_query(sql, engine)
# Get station codes associated with 'ICPW_TOCTRENDS_2015_US_LTM' project
sql = ('SELECT station_id, station_code, station_name, latitude, longitude, altitude '
'FROM resa2.stations '
'WHERE station_id in (SELECT station_id '
'FROM resa2.projects_stations '
'WHERE project_id = 3870)')
trend_df = pd.read_sql_query(sql, engine)
# Join in original site codes
core_df = core_df.merge(nfc_df, on='station_id', how='left')
trend_df = trend_df.merge(nfc_df, on='station_id', how='left')
print 'Sites in core ICPW project:', len(core_df)
print 'Sites in trends project: ', len(trend_df)
Regardless of the station codes and IDs in RESA2, I should now be able to make sense of the US data using the actual USEPA codes. For example, the sites in trend_df
should be a true sub-set of those in core_df
, and all the site properties should agree.
In [9]:
# Inner join dfs
df = trend_df.merge(core_df, on='nfc_code', how='inner',
suffixes=('_t', '_c'))
# Testing
assert len(df) == 76, 'Incorrect number of sites.'
for item in ['station_name', 'latitude', 'longitude', 'altitude']:
assert (df[item + '_t'] == df[item + '_c']).all(), 'Mismatch in %ss.'
print 'Check complete. All properties match.'
Great - I think the US LTM sites are now more-or-less sorted. Only another 20 countires to go! Note that, with the changes made above, it is now possible to extract the NFC station codes from RESA2 in the same way as for the other station properties (i.e. using the Additional station data
tab). There are still some issues to be aware of though. In particular, if you choose to export station properties from RESA2 to Excel, the RESA2 code will convert any all-numeric NFC codes to numbers. The result is that NFC codes such as 013425
are truncated as 12345
. This is not a problem with the database (the values in Oracle are correct) - it is a problem with the RESA2 code that copies results from the database into Excel. I'm not going to delve into this at the moment as I'm keen to avoid becoming too involved with the RESA2 application itself. As a workaround, it is safer to export from RESA2 as CSV and then import the CSV into Excel, taking care to set the column type for the nfc_code
field to Text
during the import process.
Vladimir has suggested deleting all the Czech data and then uploading it again from scratch. I'm reluctant to delete more than a decade's worth of data, but an alternative option more consistent with what's been done before would be to rename the existing Czech site codes and shift them into and EXCLUDED
project. I can then upload the new Czech data using the same site codes as previously, which will hopefully avoid the issues created when the trend data was similarly uploaded, but using modified (i.e. X15:
) prefixes.
Firstly, some points to note:
The "old" database has Czech data for 9 sites. Two of these, Lysina (CZ07
) and Pluhuv Bor (CZ09
) have higher resolution data than the rest. See Section 3 of this notebook for full details.
In his latest data submission, Vladimir has only supplied monthly resolution data for sites CZ01
to CZ08
inclusive (i.e. excluding Pluhuv Bor). This is exactly what's required for the core ICPW dataset, but we may wish to include some of old (weekly) data from Lysina and Pluhuv Bor in the trends analysis (assuming it's considered good enough to use).
For the moment, I propose shifting all of the existing Czech data (sites CZ01
to CZ09
) into a new project called ICPWaters CZ Excl
. I will then upload the new data for sites CZ01
to CZ08
and associate these records with both the core ICPW project and the trend analysis work. Both projects will therefore access exactly the same (monthly resolution) data, which is more consistent than the data structure used at present. The downsides are that the trends project will then make use of lower resolution (monthly rather than weekly) data for Lysina (CZ07
) and have no data at all for Pluhuv Bor (CZ09
). Given that we are estimating trends from annual averages anyway, I don't think the differences in temporal resolution are a problem (in fact, using monthly data is arguably better as it's more consistent with our statistical analyses elsewhere). It is also possible to include Pluhuv Bor in the trends project if we want to, but Jakub has previously stated that it is a very well-buffered catchment that does not suffer from acidification (see e-mail received 29/06/2016 at 10:58), so I'm not sure it's appropriate anyway? Check with Heleen.
The first step is to create a new project called ICPWaters CZ Excl
. I can then rename all the existing Czech site codes (CZ01
etc.) by adding the suffix _Old
, followed by shifting them over to the newly created project. The second step is to create 8 new sites (with identical properties to the old ones), assigning them the same site codes as used previously (CZ01
etc.). These new sites have no data associated with them, so I should be able to upload the revised data from Vladimir without creating any conflicts (I hope). All of this is most easily done manually using Access.
Note: This process involves creating duplicate sites and is therefore superficially similar to the duplication described above, which has already caused lots of confusion. However, these problems have primarily been caused because we have two sets of sites (core and trends), which represent the same locations but which often have different site properties. More importantly, both of these projects are considered "active", and data is periodically appended to update them. This is what causes the problems: we have two supposedly identical datasets evolving in parallel, and over time differences emerge that are difficult to correct.
For the Czech sites, I'm moving all the old data into an EXCLUDED
project, marking the site names as OLD
, and also adding a description to the STATIONS
table saying they are Associated with data supplied prior to August 2016
. In principle, I'd be happy to delete all of this data entirely (which would remove any concerns about duplication and database normalisation etc.), but I don't want to lose any information if I can avoid it. Besides, deleting records from RESA2 is not straightforward, due to all the other database interactions (a topic I'd like to avoid getting involved in for the moment). My hope is that the changes I'm making here will not cause further confusion, because the sites with the suffix _Old
will be discontinued completely i.e. no new data will be associated with them and they won't be used in subsequent projects. This should avoid the messy situation that we're now in with sites from other countries.
With the new project structure in place, I can use upload_icpw_template.ipynb to load the latest data from Vladimir into the database. This seems to have worked successfully, but for some reason the available parameters for the new sites are not showing up properly in the RESA2 application. They are there, and the values appear to be correct, but the Refresh parameter list
button is not working as it used to. However, if the STANDARD
button is pressed to select the routine parameters of interest, everything works as expected and the data can be exported to Excel as usual. The bit that's missing is that the user is no longer presented with the option of selecting custom parameters. I'm not sure what's happening here - another of the mysteries of RESA2! I'll have to ask Tore where his code pulls this information from and modify the database accordingly, but I'm not going to worry about this for now, as all my code will interact directly with the underlying Oracle database itself.
To check that my changes have worked as expected, I want to compare the data now in the database with what's in Vladimir's spreadsheet. To do this, I've manually extracted time series for the 8 Czech sites from RESA2 and saved them in check_czech.xlsx. The code below plots these values from RESA2 against the raw values in Valdimir's spreadsheet, which hopefully should agree.
In [29]:
# Read data
in_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
'\Replies\czech_republic\check_czech.xlsx')
r2_df = pd.read_excel(in_xlsx, sheetname='RESA')
vl_df = pd.read_excel(in_xlsx, sheetname='Orig')
In [35]:
# Define params of interest
params = ['pH', 'Ca', 'Mg', 'Na', 'K', 'Cl', 'SO4', 'ALK', 'TOC']
fig, axes = plt.subplots(nrows=9, ncols=8, figsize=(15, 20))
# Loop over data
for row, param in enumerate(params):
# Get data
df1 = r2_df[['Code', 'Date', param]]
df2 = vl_df[['Code', 'Date', param]]
# Pivot
df1 = df1.pivot(index='Date', columns='Code', values=param)
df2 = df2.pivot(index='Date', columns='Code', values=param)
# Join
df = df1.merge(df2, how='outer',
left_index=True, right_index=True)
for site in range(1, 9):
# Get data for this site
s_df = df[['CZ%02d_RESA' % site, 'CZ%02d_Orig' % site]]
s_df.dropna(how='any', inplace=True)
# Plot
s_df.plot(ax=axes[row, site-1], legend=False)
axes[row, site-1].set_title('CZ%02d %s' % (site, param))
plt.tight_layout()
plt.show()
This looks proimising: there is only one colour visible on these plots, but two lines are being plotted, so I assume the results from RESA are overlapping exactly with those from Vladimir's spreadsheet.
The final step for this notebook is to update my trends code and re-run the analysis. Note that there are still some known issues with the data, so these results should not be taken as definitive, but hopefully they are a step in the right direction.
My previous trends notebook left a number of loose ends, which need tidying up here:
My current code averages samples taken from 0 and 0.5 m depth. In his e-mail from 21/07/2016 at 14:54, Don recommended only using samples from 0.5 m. Changing this will require some careful coding, as many of the samples seem to have depth = 0
as a kind of default, so if I select only those records where depth = 0.5
, I'll end up missing a large chunk of the data. My algorithm needs to first select all the surface water data, and then choose the 0.5 m sample only in cases where there are duplicates (i.e. results for both 0.5 m and 0.0 m).
My original code used reference ratios for sea-salt corrections which I back-calculated from RESA2. In an e-mail received 16/08/2016 at 14:04 Heleen provided the true values used for the original analysis. I think these agree with my back-calculated values, but this needs checking.
My code does not currently calculate trends for the correct parameters. Based on Heleen's e-mail (see above), the main quantities of interest are:
Deal with duplicates. The previous notebook highlighted some other duplicates for which I couldn't find an explanation (see section 3.2.3). Suzanne hasn't responsed to my e-mails, so I'm not in a position to correct these issues at this stage. For now, I propose to take the most recently uploaded of the duplicate values, as these are usually the ones for which I can find a "paper-trail".
Choose a method for dealing with values at the detection limit. There are no hard-and-fast rules here and the choice is unlikely to dramatically influence trend results calculated using non-parametric statistics (i.e. Sen's slope calculated from annual medians). At present, my code simply substitutes the detection limit value, which is the easiest and most transparent approach. I can change this if there are strong arguments for using an alternative method, though.
Some more careful checking of the database shows that this is actually a very minor problem. Records with measurements from both 0 m and 0.5 m only occur for a few of the Norwegian sites back in the 1970s. For these measurements, the chemistry values are the same regardless of which depth is chosen, so the problem is actually negligible.
The code below finds all of the duplicate depth associated with all the ICPW trends sites.The issue is so limitied that I don't think it needs further consideration.
In [5]:
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CA_ATL',
'ICPW_TOCTRENDS_2015_CA_DO',
'ICPW_TOCTRENDS_2015_CA_ICPW',
'ICPW_TOCTRENDS_2015_CA_NF',
'ICPW_TOCTRENDS_2015_CA_QU',
'ICPW_TOCTRENDS_2015_CZ',
'ICPW_TOCTRENDS_2015_Cz2',
'ICPW_TOCTRENDS_2015_FI',
'ICPW_TOCTRENDS_2015_NO',
'ICPW_TOCTRENDS_2015_SE',
'ICPW_TOCTRENDS_2015_UK',
'ICPW_TOCTRENDS_2015_US_LTM',
'ICPWaters Ca']
sql = ('SELECT station_id, station_code '
'FROM resa2.stations '
'WHERE station_id IN (SELECT UNIQUE(station_id) '
'FROM resa2.projects_stations '
'WHERE project_id IN (SELECT project_id '
'FROM resa2.projects '
'WHERE project_name IN %s))'
% str(tuple(proj_list)))
stn_df = pd.read_sql(sql, engine)
sql = ("SELECT water_sample_id, station_id, sample_date "
"FROM resa2.water_samples "
"WHERE station_id IN %s "
"AND depth1 <= 1 "
"AND depth2 <= 1" % str(tuple(stn_df['station_id'].values)))
df = pd.read_sql(sql, engine)
print 'Number of duplicated records:', df.duplicated(subset=['station_id', 'sample_date']).sum()
df[df.duplicated(subset=['station_id', 'sample_date'], keep=False)]
Out[5]:
These have been checked and the values in my code agree with those in Heleen's e-mail.
I have modified the code to include ANC, calculated as
$$ANC = Ca + Mg + K + Na + NH_4 - Cl - SO_4 - NO_3$$where all values are in $μeq/l$ (as per Heleen's e-mail - see above). However, note that not all sites have complete data for all these chemical parameters e.g. the Newfoundland sites do not report NH4. I believe NH4 is usually a fairly small component of ANC, so in my code I've decided to assume $NH_4 = 0$ unless it's explicitly specified. This means that I can still calculate ANC for the Newfoundland sites (which would otherwise all return "NoData"). Is this reasonable? Are there any other chemical species that can safely be ignored when they're not reported? Check this with Heleen.
Including alkalinity is not straightforward, as there are many different methods and some of them are not fully described in the database. To make use of the alkalinity data I need to convert everything to a common scale and units (preferably in $\mu eq/l$). This can be done for some of the methods using the metadata in RESA2, but for others this is not possible. Having queried this with Heleen, she has suggested ignoring alkalinity for now (see e-mail received 17/10/2016 at 13:19), but this is something to return to later, possibly with help from Øyvind Garmo regarding how to align the various methods. For future reference, the methods in question are listed here:
C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Data_Summaries\alkalin_method_issues.xlsx
Where duplicates are reported which cannot be explained by different sampling depths (e.g. 0 m and 0.5 m, as above), my code now selects the value that has been added to the database most recently. There is relatively little justification for this decision: I haven't been able to get to the bottom of these issues and this step is really just a fudge (my code previously averaged these measurements, which is also dodgy). The main reason for choosing the most recent database addition is that I can generally find these values in the raw data, whereas the older measurements aren't so obvious. However, it's possible that this is more a reflection of my (still limited) knowledge of the data on the NIVA network, rather than because the more recent data is "more correct".
Heleen is happy to replace values below the detection limit with the detection limit itself, which is what my code currently does. See e-mail received 07/10/2016 at 15:18.
This section tests the new code by applying it to the data for all years (i.e. the full adataset for each site).
In [64]:
# Import code for trends analysis
resa2_trends_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Python\icpw\toc_trends_analysis.py')
resa2_trends = imp.load_source('toc_trends_analysis', resa2_trends_path)
The list of projects to consider has been previously agreed with Heleen (see section 3 of this notebook).
In [65]:
# User input
# Specify projects of interest
proj_list = ['ICPW_TOCTRENDS_2015_CA_ATL',
'ICPW_TOCTRENDS_2015_CA_DO',
'ICPW_TOCTRENDS_2015_CA_ICPW',
'ICPW_TOCTRENDS_2015_CA_NF',
'ICPW_TOCTRENDS_2015_CA_QU',
'ICPW_TOCTRENDS_2015_CZ',
'ICPW_TOCTRENDS_2015_Cz2',
'ICPW_TOCTRENDS_2015_FI',
'ICPW_TOCTRENDS_2015_NO',
'ICPW_TOCTRENDS_2015_SE',
'ICPW_TOCTRENDS_2015_UK',
'ICPW_TOCTRENDS_2015_US_LTM',
'ICPWaters Ca']
# Output paths
plot_fold = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\Trends_Plots')
res_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\res.csv')
dup_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\dup.csv')
nd_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Results\nd.csv')
In [5]:
# Run analysis
res_df, dup_df, nd_df = resa2_trends.run_trend_analysis(proj_list, engine,
st_yr=None, end_yr=None,
plot=True, fold=plot_fold)
# Delete mk_std_dev col as not relevant here
del res_df['mk_std_dev']
# Write output
res_df.to_csv(res_csv, index=False)
dup_df.to_csv(dup_csv, index=False)
nd_df.to_csv(nd_csv, index=False)
res_df.head(14).sort_values(by='par_id')
Out[5]:
The code seems to be working correctly, but I still have concerns about some of the values in RESA. Looking at the output (res.csv
), the following issues are fairly obvious:
A number of the Swedish stations have ridiculously high EH+ (i.e. low pH). They can be idenified with the following SQL:
SELECT * FROM RESA2.STATIONS
WHERE STATION_ID IN (36797, 36733, 36753, 36793, 36588, 36660, 36578,
36750, 36670, 36825, 36584, 36680, 36788, 36575,
36813, 36636, 36690, 36592, 36675, 36826, 36711,
36723, 36731, 36739);
As far as I can tell, my trends code is working correctly, but the values in RESA2 for these sites must surely be wrong (pH is often < 0.03)? Try to trace this.
Sites 37124 and 37129 also have very high EH+/low pH. A quick check suggets that zeros have been entered into RESA instead of NoData in a few cases - this needs correcting.
Overall, the entire dataset would benefit from being checked by someone with a better feel for realistic acid chemistry values than me.
### 6.1. Correct sites 37124 and 37129
I've removed the zeros for pH from these datasets, as they've obviously been entered in error.
### 6.2. Correct Swedish sites
Where have the very low pH values for the 24 Swedish sites come from? The raw data from the Focal Centre for these stations is here:
K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Tilsendte data fra Focalsentere\Sweden\EKSTRA2015
These files contain sensible values for pH (usually around 7), so what's gone wrong? The filenames ending with the letters “ED” denote spreadsheets created by Tore for uploading into the database. He usually uses the ICPW template for this, but in this case he’s done something different. Strangely, pH seems to be missing from the Lakesfrom 1988 3y
sheet of the "ED" files and it looks as though something has gone wrong with the upload process. In fact, it looks suspiciously like the values for Pb (in ug/l) have been entered as pH by mistake.
Correcting this is fiddly, because I suspect the problem actually affects a large number of Swedish sites, not just the ones listed above. The difficulty is that these errors aren't obvious in cases where the Pb concentrations are high enough to be sensibly interpreted as pH.
As a start, I've copied the correct pH data from the raw data file over to:
C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016\Replies\sweden\correct_ph_error.xls
Let's see how this compares to what's in the database.
In [3]:
# Read data from template
raw_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
r'\Replies\sweden\correct_ph_error.xls')
raw_df = pd.read_excel(raw_xls, sheetname='Data')
raw_df.head()
Out[3]:
In [4]:
# Get unique list of site codes
stn_list = raw_df['Code'].unique()
# Get station_ids
sql = ('SELECT * FROM resa2.stations '
'WHERE station_code IN %s' % str(tuple(stn_list)))
stn_df = pd.read_sql(sql, engine)
# Decode special characters from `windows-1252` encoding to unicode
stn_df['station_name'] = stn_df['station_name'].str.decode('windows-1252')
stn_df.head()
Out[4]:
In [14]:
# Get all samples from these sites
sql = ('SELECT * FROM resa2.water_samples '
'WHERE station_id IN %s' % str(tuple(stn_df['station_id'])))
samp_df = pd.read_sql(sql, engine)
samp_df.head()
Out[14]:
In [17]:
# Get all the pH data for these sites
sql = ('SELECT * FROM resa2.water_chemistry_values2 '
'WHERE sample_id IN (SELECT water_sample_id '
'FROM resa2.water_samples '
'WHERE station_id IN %s) '
'AND method_id = 10268' % str(tuple(stn_df['station_id'])))
wc_df = pd.read_sql(sql, engine)
wc_df.head()
Out[17]:
In [46]:
# Join samples
df = pd.merge(wc_df, samp_df, how='left',
left_on='sample_id',
right_on='water_sample_id')
# Join stations
df = pd.merge(df, stn_df, how='left',
left_on='station_id',
right_on='station_id')
# Join raw data
df = pd.merge(df, raw_df, how='left',
left_on=['station_code', 'sample_date'],
right_on=['Code', 'Date'])
# Extract columns of interest
df = df[['station_id', 'station_code', 'station_name', 'sample_date',
'sample_id', 'method_id', 'flag1', 'value', 'pH', 'Pb']]
df.head(10)
Out[46]:
The value
column in the above table shows pH measurements according to RESA2. The pH and Pb columns show the values in the raw data. It is clear that the database values agree with those in the raw data in many places (i.e. the database is correct), which only makes this issue stranger: I would expect all of this data to be uploaded in a single go, so it's surprising that the results are correct for some sites but not for others. I'm not sure what's going on here!
Nevertheless, there are 310 occasions where values for Pb have been entered as pH by mistake.
In [47]:
print 'Number of occasions where Pb entered instead of pH:', len(df.query('value != pH').dropna(subset=['pH',]))
df2 = df.query('value != pH').dropna(subset=['pH',])
df2.head()
Out[47]:
The next step is to loop over these 310 records, replacing the numbers in the value
column with those in the pH
column. Note that a few of the Pb entries were associated with "less than" flags, so these need clearing as well.
In [55]:
df2.query('flag1 == "<"')
Out[55]:
In [61]:
# Loop over rows
for index, row in df2.iterrows():
# Get data
samp_id = row['sample_id']
ph = row['pH']
# Update chem table
sql = ("UPDATE resa2.water_chemistry_values2 "
"SET value = %s, "
"flag1 = NULL "
"WHERE sample_id = %s "
"AND method_id = 10268"
% (ph, samp_id))
result = conn.execute(sql)
In [62]:
# Check changes have taken effect
sql = ('SELECT * FROM resa2.water_chemistry_values2 '
'WHERE sample_id = 597377 '
'AND method_id = 10268')
df3 = pd.read_sql(sql, engine)
df3
Out[62]:
Finally, re-run the trends code to make sure the EH+ values are now more sensible.
In [68]:
# Run analysis
res_df, dup_df, nd_df = resa2_trends.run_trend_analysis(proj_list, engine,
st_yr=None, end_yr=None,
plot=False, fold=None)
# Delete mk_std_dev col as not relevant here
del res_df['mk_std_dev']
res_df.sort_values(by='mean', ascending=False).head()
Out[68]:
Good. (Previously, the mean
and median
columns included values close to $1.10^6$ for EH, corresponding to pH values of approximately 0. In the newly cleaned data, the largest value in the mean column is 1006, this time associated with Al. I don't know enough about Al to say whether this is reasonable, but it's at least a step in the right direction).