In [29]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')
Back from holiday and trying to remember where I'd got to with the ICPW database. John has sent detailed replies to my questions regarding the US sites (see e-mails received 25/08/2016 at 01:45 and 03:33), so this seems like a good place to start.
My initial clean-up of RESA2 ended with 72 sites linked to the ICPW_TOCTRENDS_2015_US_LTM
project (see section2 of this notebook for details) and, following advice from Tore, I also shifted 19 US sites into a new project called ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED
. In addition, manually checking RESA2 identifies 95 sites associated with the ICPWaters US
project.
Based on the information in John's emails, some of this needs revisiting.
In [2]:
# Import custom functions and connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
r'\useful_resa2_code.py')
resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)
engine, conn = resa2_basic.connect_to_resa2()
In [4]:
# Check numbers of sites
proj_codes = [1679, # ICPWaters US
3870, # ICPW_TOCTRENDS_2015_US_LTM
4150] # ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED
for code in proj_codes:
# Read stations list
sql = ('SELECT * '
'FROM resa2.projects_stations '
'WHERE project_id = %s' % code)
df = pd.read_sql_query(sql, engine)
print code, len(df)
Based on John's e-mail, four of the sites that I previously excluded should actually be included in the trends project, so I need to move these back across:
Site ID | Site name |
---|---|
040874O | Brook Trout Lake, Adirondacks |
1A1-109O | Moss Lake, Adirondacks |
1A1-110O | Lake Rondaxe, Adirondacks |
ME-4474 | Duck Pond, Maine |
John has sent a spreadsheet (U.S.Site.Reconciliation.August.2016.xlsx) listing which sites should be included in which projects. First, compare the number of sites in John's spreadsheet with the numbers in RESA2.
In [3]:
# Read John's spreadsheet
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
'\Replies\usa\U.S.Site.Reconciliation.August.2016.xlsx')
j_df = pd.read_excel(in_xls, sheetname='Sheet1')
print 'John, core ICPW sites: ', len(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'])
print 'John, DOC trends sites:', len(j_df[j_df['INCLUDE IN ICP DOC ANALYSIS']=='YES'])
This looks promising. With a bit of luck, I can just move the four sites listed above from the EXCLUDED
project back into ICPW_TOCTRENDS_2015_US_LTM
and everything will be OK...
...actually, no - it's not that simple. Of the 95 ICPWaters US
sites in the spreadsheet, John only has RESA2 codes for 91 of them (four are marked NA). This implies the 95 sites currently in RESA2 are not the same as the 95 in the spreadsheet, so I need to work out where the discrepancies are.
In [28]:
# Get the 95 'ICPWaters US' sites from RESA2
sql = ('SELECT * FROM resa2.stations '
'WHERE station_id IN (SELECT station_id '
'FROM resa2.projects_stations '
'WHERE project_id = 1679)')
core_df = pd.read_sql_query(sql, engine)
# Outer join based on RESA2 station code
df = core_df.merge(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'],
how='outer',
left_on='station_code',
right_on='Station Code')
# Cols of interest
cols = ['station_id', 'station_code', 'station_name', 'latitude', 'longitude',
'Station Code', 'NFC_SITEID', 'NFC_SITENAME', 'Latitude', 'Longitude']
df = df[cols]
# Get entries where no match was found
df[pd.isnull(df['station_code']) | pd.isnull(df['Station Code'])]
Out[28]:
So, there are four stations (rows 60, 69, 70 and 71 in the table above) currently included in the ICPWaters US
project that should be removed. There are also four stations (rows 95 to 98 above) that are present in John's spreadsheet but not associated with the project. Of these, two of them (Middle Pond and Tunk) are already in the database but associated with the ICPW_TOCTRENDS_2015_US_LTM
project only, while one of them (Sucker) has been shifted to the ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED
project. The remaining site (Barnes Lake) does not appear anywhere in the database as far as I can tell.
Next steps:
ICPWaters US
project (and transfer them to EXCLUDED
)ICPWaters US
projectICPWaters US
projectICPW_TOCTRENDS_2015_US_LTM_EXCLUDED
to ICPW_TOCTRENDS_2015_US_LTM
These changes are most easily made manually using Access. As a quick check, let's make sure there are now 95 sites associated with ICPWaters US
and 76 sites associated with ICPW_TOCTRENDS_2015_US_LTM
In [31]:
# Check numbers of sites
proj_codes = [1679, # ICPWaters US
3870] # ICPW_TOCTRENDS_2015_US_LTM
for code in proj_codes:
# Read stations list
sql = ('SELECT * '
'FROM resa2.projects_stations '
'WHERE project_id = %s' % code)
df = pd.read_sql_query(sql, engine)
print code, len(df)
I wish this section was as exciting as the title makes it sound!
Section 2 of the previous notebook identified two Mud Ponds. The one with RESA2 station_id=37063
is correct, but the other (station_id=23709
) has incorrect co-ordinates and is several hundred kilometres from the true location. The incorrect site has become associated with lots of different projects and this means that deleting it is not straightforward. Let's get a list of projects using each of these sites.
In [10]:
# Find projects using the different versions of Mud Pond
for site_id in [23709, 37063]:
sql = ('SELECT project_id, project_name '
'FROM resa2.projects '
'WHERE project_id IN (SELECT project_id '
'FROM resa2.projects_stations '
'WHERE station_id = %s)' % site_id)
df = pd.read_sql_query(sql, engine)
print 'Projects using station_id %s:\n' % site_id
print df
print '\n'
So, the incorrect site has been used in 16 different projects, most (all?) of which are linked to ICP Waters in some way. The only project using the correct site is ICPW_TOCTRENDS_2015_US_LTM
. Before I can delete the incorrect site, I need to switch all these other projects to use station_id=37063
instead.
NB: Based on the above output, ICPW_TOCTRENDS_2015_US_LTM
currently uses both Mud Ponds, so when I delete the incorrect one, there will only be 75 sites associated with this project. This means I'm still missing a site somewhere compared to John's spreadsheet. Come back to this.
In Access:
station_id=23709
and ICPW_TOCTRENDS_2015_US_LTM
in the projects_stations
table.station_id
from 23709
to 37063
.23709
from the stations_par_values
table.The tricky part is how to deal with the water chemistry results associated with each of these sites.
In [28]:
# Read sample info for incorrect site
sql = ('SELECT water_sample_id, station_id, sample_date '
'FROM resa2.water_samples '
'WHERE station_id = 23709')
df_23709 = pd.read_sql_query(sql, engine)
# Read sample info for correct site
sql = ('SELECT water_sample_id, station_id, sample_date '
'FROM resa2.water_samples '
'WHERE station_id = 37063')
df_37063 = pd.read_sql_query(sql, engine)
# Outer join
df = df_23709.merge(df_37063, how='outer',
on='sample_date',
suffixes=['_23709', '_37063'])
print 'Number of samples associated with site 23709:', len(df_23709)
print 'Number of samples associated with site 37063:', len(df_37063)
print 'Number of samples with common dates: ', len(df.dropna(how='any'))
It looks as though there is at least some overlap here, but it's not easy to see what's going on without some plots. Let's pick a commonly measured parameter ($Cl^-$ in $mg/l$) and plot the values for each site to see whether they're actually the same.
In [52]:
# Site 23709
sql = ('SELECT sample_id, value '
'FROM resa2.water_chemistry_values2 '
'WHERE method_id = 10253 '
'AND sample_id IN (SELECT water_sample_id '
'FROM resa2.water_samples '
'WHERE station_id = 23709)')
df_23709 = pd.read_sql_query(sql, engine)
# Join dates
df_23709 = df_23709.merge(df, how='left',
left_on='sample_id',
right_on='water_sample_id_23709')
# Tidy df
df_23709.sort_values(by='sample_date', inplace=True, ascending='True')
df_23709.index = df_23709['sample_date']
df_23709 = df_23709[['value']]
df_23709.columns = ['site_23709']
# Site 37063
sql = ('SELECT sample_id, value '
'FROM resa2.water_chemistry_values2 '
'WHERE method_id = 10253 '
'AND sample_id IN (SELECT water_sample_id '
'FROM resa2.water_samples '
'WHERE station_id = 37063)')
df_37063 = pd.read_sql_query(sql, engine)
# Join dates
df_37063 = df_37063.merge(df, how='left',
left_on='sample_id',
right_on='water_sample_id_37063')
# Tidy df
df_37063.sort_values(by='sample_date', inplace=True, ascending='True')
df_37063.index = df_37063['sample_date']
df_37063 = df_37063[['value']]
df_37063.columns = ['site_37063']
# Merge
merged_df = df_23709.merge(df_37063, how='outer',
left_index=True,
right_index=True)
print 'Total number of samples in merged series:', len(merged_df)
# Plot
merged_df.plot(figsize=(12, 8))
plt.title('Chloride concentration in mg/l at two Mud Ponds', fontsize=20)
plt.tight_layout()
mpld3.display()
Out[52]:
It's pretty clear that this is the same data series (although something weird happens in 1996 - use the pan and zoom tools to see this). However, there appears to be no consistent pattern regarding which samples have been assigned to which site. The 1996 discrepancy seems pretty minor, so the best way to fix this is probably as follows:
This should create a single series for site 37063 with 132 water samples. Any remaining samples associated with site 23709 can then be deleted (as they're already duplicated for site 37063) and then, with a bit of luck, I should also be able to remove the "bad" Mud Pond from the database without inflicting too much damage elsewhere.
In [60]:
# Identify samples associated ONLY with site 23709
trans_df = df[pd.isnull(df['station_id_37063'])]
# These need transferring to site 37063
sql = ('UPDATE resa2.water_samples '
'SET station_id = 37063 '
'WHERE water_sample_id IN %s'
% str(tuple(trans_df['water_sample_id_23709'].map(int))))
result = conn.execute(sql)
Now plot the results for chloride again to make sure it's worked.
In [3]:
# Site 23709
# Get dates
sql = ('SELECT water_sample_id, sample_date '
'FROM resa2.water_samples '
'WHERE station_id = 23709')
dates_23709 = pd.read_sql_query(sql, engine)
sql = ('SELECT sample_id, value '
'FROM resa2.water_chemistry_values2 '
'WHERE method_id = 10253 '
'AND sample_id IN (SELECT water_sample_id '
'FROM resa2.water_samples '
'WHERE station_id = 23709)')
df_23709 = pd.read_sql_query(sql, engine)
# Join dates
df_23709 = df_23709.merge(dates_23709, how='left',
left_on='sample_id',
right_on='water_sample_id')
# Tidy df
df_23709.sort_values(by='sample_date', inplace=True, ascending='True')
df_23709.index = df_23709['sample_date']
df_23709 = df_23709[['value']]
df_23709.columns = ['site_23709']
# Site 37063
# Get dates
sql = ('SELECT water_sample_id, sample_date '
'FROM resa2.water_samples '
'WHERE station_id = 37063')
dates_37063 = pd.read_sql_query(sql, engine)
sql = ('SELECT sample_id, value '
'FROM resa2.water_chemistry_values2 '
'WHERE method_id = 10253 '
'AND sample_id IN (SELECT water_sample_id '
'FROM resa2.water_samples '
'WHERE station_id = 37063)')
df_37063 = pd.read_sql_query(sql, engine)
# Join dates
df_37063 = df_37063.merge(dates_37063, how='left',
left_on='sample_id',
right_on='water_sample_id')
# Tidy df
df_37063.sort_values(by='sample_date', inplace=True, ascending='True')
df_37063.index = df_37063['sample_date']
df_37063 = df_37063[['value']]
df_37063.columns = ['site_37063']
# Merge
merged_df = df_23709.merge(df_37063, how='outer',
left_index=True,
right_index=True)
# Plot
merged_df.plot(subplots=True, sharey=True, figsize=(12, 8))
plt.tight_layout()
mpld3.display()
Out[3]:
This looks OK, so I can now delete all the remaining records associated with station_id=23709
...
...Actually, I can't delete these samples. They're protected by the database constraint SS_WSID_FK
, which links water_samples
to sample_selections
. I'm not sure what the sample_selections
tables is, and I'm wary of deleting records from tables I don't understand in case it breaks some other component of the wider LabWare-RESA2-NivaBase system.
For now, I'll leave site 23709 in the database, along with the remaining samples that are associated with it. All i've actually done here is transfer 25 water samples from the early record of station 23709 to station 37063. It won't be difficult to undo these changes if necessary, but it would be good to tidy this up properly. Come back to this and find out what the sample_selections
table actually does.
It seems the Mystery of Mud Pond runs a bit deeper than originally thought. Many of the ICPW sites in RESA2 are duplicated, with the following as a typical example:
station_id | station_code | station_name | lat | long |
---|---|---|---|---|
23661 | US26 | Biscuit Brook, Catskills | 41.9869 | -74.5031 |
37014 | X15:1434025 | BISCUIT BROOK | 42.0111 | -74.4147 |
The first entry (station_id 23661
) is associated with the "core" ICP Waters programme, whereas the second (station_id 37014
) was added later as part of the trend analysis project. Despite the slight differences in co-ordinates, these are actually the same site.
Adding stations twice is a little dangerous in a relational database, as it breaks the principle of database normalisation. Nevertheless, as Tore points out in his e-mail of 01/09/2016 at 08:51, it is sometimes justified as the most convenient thing to do. There are some potential problems with this, though. For example, when John and Don were here in May, they extracted site data for both the ICPWaters US
and ICPW_TOCTRENDS_2015_US_LTM
projects and were understandably confused by the different site codes and co-ordinates. More importantly, when sites are duplicated, it's very easy to forget to update both versions when new data comes in. The result, as illustrated by Mud Pond above, is we end up with two patchy data records, neither of which tell the whole story by themselves. This wouldn't be a problem if the ICPWaters XX
projects (and the associated sites e.g. US26 in the table above) were "archived" to represent the state of the data at the time of some previous analysis. Any more recent data would then be added to a new project. However, as far as I can tell, this isn't the case: each year new data from the Focal Centres are added to the ICPWaters XX
sites, while ICPW_TOCTRENDS_2015_XX
is more of a "static" dataset, based on a bulk data upload at a single point in time.
For the US sites, John has produced a spreadsheet that attempts to sort all this out, by linking the RESA2 codes used by ICPWaters
to USEPA codes and corrected site properties. The main question now is how to use this information.
Given that the ICPWaters XX
projects do not represent an accurate archive of the data at some previous time point (because more data has been added subsequently), and considering that many of the latitudes and longitudes for these stations are now known to be incorrect, there doesn't seem much justification for keeping the duplicate sites. However, because the water samples associated with the individual sites in each pair are different (see the Mud Pond example above), it's not necessarily straightforward to merge the duplicated datasets. A simpler option would be to keep the sites separate but to correct/update the site properties (names, co-ordinates etc.) based on John's spreadsheet, so at least the duplicated sites are consistently duplicated. This would require less effort, but will likely lead to further confusion in the future, and it also means the updated trends analysis will not make use of the full dataset.
Rather than worrying about which projects have which sites, a useful preliminary step will be to simply check which of the sites in John's spreadsheet are actually in the database, and whether the properties are correct. Even this isn't entirely straightforward, but I've had a go using the code below.
In [4]:
# Read stations table
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
'FROM resa2.stations')
stn_df = pd.read_sql_query(sql, engine)
stn_df.head()
Out[4]:
In [14]:
# Get a subset of columns from John's spreadsheet
us_df = j_df[['Station Code', 'Station name', 'Suggested Name',
'Active', 'INCLUDE IN ICP WATERS DATABASE',
'INCLUDE IN ICP DOC ANALYSIS', 'NFC_SITEID',
'NFC_SITENAME', 'Latitude', 'Longitude']]
# Rename columns
us_df.columns = ['station_code', 'station_name', 'suggested_name',
'active', 'core', 'trend', 'nfc_code', 'nfc_name',
'latitude', 'longitude']
us_df.head()
Out[14]:
In [37]:
# Dict to store results
res_dict = {'stn_code':[],
'code_present':[],
'name_ok':[],
'lat_ok':[],
'lon_ok':[]}
# Iterate over spreadsheet
for idx in range(len(us_df)):
# Get stn id
stn_cd = us_df.ix[idx, 'station_code']
name = us_df.ix[idx, 'suggested_name'] # These are John's ideal names
lat = us_df.ix[idx, 'latitude']
lon = us_df.ix[idx, 'longitude']
# Check stations table
q_res = stn_df.query('station_code == @stn_cd')
if len(q_res) == 0:
# The site isn't present with this code
res_dict['stn_code'].append(stn_cd)
res_dict['code_present'].append(0)
res_dict['name_ok'].append(np.nan)
res_dict['lat_ok'].append(np.nan)
res_dict['lon_ok'].append(np.nan)
elif len(q_res) == 1:
# Check the site properties
res_dict['stn_code'].append(stn_cd)
res_dict['code_present'].append(1)
res_dict['name_ok'].append(name == q_res.station_name.values[0])
res_dict['lat_ok'].append(lat == q_res.latitude.values[0])
res_dict['lon_ok'].append(lon == q_res.longitude.values[0])
else:
# This should be impossible
raise ValueError('Site code %s appears to be duplicated.' % stn_cd)
# Build df
res_df = pd.DataFrame(res_dict)
res_df = res_df[['stn_code', 'code_present',
'name_ok', 'lat_ok', 'lon_ok']]
res_df.head()
Out[37]:
Note the Unicode warning here, highlighting that some matches on the name
column have failed due to issues with special text characters, so the results are probably a worst case scenario. Nevertheless, some of the sites match exactly. Let's see how many matches and partial matches we have.
In [38]:
# How many matches?
print 'Total number of records in spreadsheet: ', len(res_df)
print 'Of these...'
print 'Number matched by code: ', len(res_df.query('code_present==1'))
print 'Number matched by code with correct lat and lon: ', len(res_df.query('lat_ok==1 & lon_ok==1'))
print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))
The script above uses RESA2 codes taken from John's spreadsheet. For the most part these are the old codes (e.g. US26
) associated with the original ICPWaters US
project (rather than the more recent codes, such as X15:1434025
, used in the trends project). It seems that 58 of these sites have exactly the correct information (name, lat and long), and a further 9 have the correct lat and long, some do not use John's suggested ideal names. A further 28 sites can be matched by code, but have incorrect geographic co-ordinates.
In total, there are 95 matches based on site codes. A further 15 sites (the ones marked NA
in John's spreadsheet) cannot be matched at all but, of these, only 4 should be included in the core ICPW dataset and only 2 in the trends analysis. These four sites have already been dealt with above (see the start of section 1.2). They are in the database with suitable codes - it's just that these codes are in John's workbook.
What about the sites created as duplicates for the trend analysis project? How many of those can be matched from John's spreadsheet?
In [69]:
# Dict to store results
res_dict = {'nfc_code':[],
'code_present':[],
'name_ok':[],
'lat_ok':[],
'lon_ok':[]}
# Iterate over spreadsheet
for idx in range(len(us_df)):
# Get stn id
nfc_cd = us_df.ix[idx, 'nfc_code']
name = us_df.ix[idx, 'nfc_name'] # These are John's ideal names
lat = us_df.ix[idx, 'latitude']
lon = us_df.ix[idx, 'longitude']
# Check stations table. Need to add 'X15:' and allow for variants
q_res = stn_df[(stn_df['station_code']=='X15:%s' % nfc_cd) |
(stn_df['station_code']=='X15:%s' % nfc_cd[1:]) |
(stn_df['station_code']=='X15:%s' % nfc_cd[:-1])]
if len(q_res) == 0:
# The site isn't present with this code
res_dict['nfc_code'].append(nfc_cd)
res_dict['code_present'].append(0)
res_dict['name_ok'].append(np.nan)
res_dict['lat_ok'].append(np.nan)
res_dict['lon_ok'].append(np.nan)
elif len(q_res) == 1:
# Check the site properties
res_dict['nfc_code'].append(nfc_cd)
res_dict['code_present'].append(1)
res_dict['name_ok'].append(name == q_res.station_name.values[0])
res_dict['lat_ok'].append(lat == q_res.latitude.values[0])
res_dict['lon_ok'].append(lon == q_res.longitude.values[0])
else:
# This should be impossible
raise ValueError('Site code %s appears to be duplicated.' % stn_cd)
# Build df
res_df = pd.DataFrame(res_dict)
res_df = res_df[['nfc_code', 'code_present',
'name_ok', 'lat_ok', 'lon_ok']]
res_df.head()
Out[69]:
In [70]:
# How many matches?
print 'Total number of records in spreadsheet: ', len(res_df)
print 'Of these...'
print 'Number matched by code: ', len(res_df.query('code_present==1'))
print 'Number matched by code with correct lat and lon: ', len(res_df.query('lat_ok==1 & lon_ok==1'))
print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))
This approach manages to match most (but not necessarily all) of the US sites using th X15:
convention, but note that I had to jump through a few hoops to get here: the code above initially tried to find matches just using the raw NFC code with X15:
prepended to it, but this failed in many cases. Two causes (there may be others) are:
Leading zeros on the NFC site codes have been truncated upon adding to RESA. For example, station 01364959
is in the database X15:1364959
.
Trailing letter codes have been removed (or weren't included originally) upon adding to RESA2. For example, station 1C1-112E
is in the database as X15:1C1-112
.
Unfortunately, these kinds of discrepancies makes automated cleaning of the database difficult.
A final test for the moment is to see whether I've actually managed to find all the stations that John would like including in the trends analysis.
In [78]:
# Filter dfs
match_df = res_df.query('code_present==1')
trend_df = us_df.query('trend == "YES"')
# Join
trend_df = trend_df.merge(match_df, how='left',
on='nfc_code')
# Identify records with no match
trend_df[pd.isnull(trend_df['code_present'])]
Out[78]:
So I've manged to match all the sites for the trends work, with two exceptions. The first is (the increasingly troublesome) Mud Pond, which hasn't matched because when I tidied it up above I changed the site code for station 37063
back to the original US74
, rather than the X15:
version. The second is Newbert Pond, which is in the database as part of the core ICP Waters programme, but apparently hasn't previously been part of the trends analysis.
This exploration proves that, with a bit of work, I can identify the sites that John would like including in each project and clean up their properties. I haven't really begun to tackle the problem of actually integrating the data, though - that's going to be much more challenging.
It's also worth emphasising that the above was only really possible thanks to John's spreadsheet, so doing the same thing for other countries may be more difficult.