In [4]:
%matplotlib inline
import nivapy3 as nivapy
import mvm_python
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sn
plt.style.use('ggplot')
In [6]:
# Connect to NIVABASE
eng = nivapy.da.connect()
This notebook describes cleaning and updating data for the ~260 "core" ICPW stations (as opposed to the broader network of 430 "trends" stations).
Note that there is substantial overlap between the "core" and "trends" sites, but the data are stored separately in the database: the most recent data for the "trends" work is associated with a project named 'ICPW_TOCTRENDS_2018'
(project_id=4390), whereas the "core" stations are split according to country with projects named 'ICPWaters CA'
, 'ICPWaters NO'
etc. In theory, where a station is used in both the "core" and "trends" work, the associated chemistry data should be the same, but in practice this is not always be the case. Note also that, in 2015, Tore created a set of country-specific "TOC trends" projects, named e.g. 'ICPW_TOCTRENDS_2015_NO'
. These include yet more duplicated data, but they should now be considered deprecated as they have been superseded by 'ICPW_TOCTRENDS_2018'
.
The purpose of the work presented here is to (i) quality assess and upload the latest data for the "core" stations; (ii) review - and where necessary update - the sites used in each country (e.g. changes to monitoring mean that some locations are no longer suitable for ICPW); and (iii) to check that the "core" and "trends" datasets are broadly in agreement.
There are 18 Canadian stations in the "core" ICPW programme and I have added the most recent templates to the database.
The code below applies a correction factor of 1.28 to the records for TOC in Nova Scotia collected before March 1995. See e-mail exchanges with Don, John and Heleen on 15/02/2019 for further details.
In [3]:
## Switch TOC/DOC pre-March 1995 to method with correction factor of 1.28
#with eng.begin() as conn:
# sql = ("UPDATE resa2.water_chemistry_values2 "
# "SET method_id = 10823 "
# "WHERE sample_id IN ( "
# " SELECT water_sample_id "
# " FROM resa2.water_samples "
# " WHERE station_id IN (23461, 23462, 23463, 23464, 23465) "
# " AND sample_date < DATE '1995-03-01') "
# "AND method_id IN (10294, 10273)")
# conn.execute(sql)
Jussi has supplied an entirely new dataset for all Finnish stations covering the period from 1990 to 2017. This supersedes the data already in the database. I have therefore removed all post-1990 data for both "core" and "TOC_TRENDS_2015" stations, and replaced the values with the most recent dataset. I have also updated the station properties (NFC codes etc.) based on the latest information provided.
One of the "core" Finnish stations, FI13, is no longer monitored. Jussi has proposed that this station be removed from the "core" and replaced with Nat_FI_27. This is a nearby station with similar characteristics and ongoing monitoring. It is also one of the "trends" stations, with code 'Nat_FI_27'
in the 2015 trends dataset and code 'Tr18_FI_39478'
in the latest trends work.
Based on this, I have removed FI13 from the "core" and substituted Nat_FI_27 instead.
The Czech data was reviewed and updated during 2016. The latest data provided during 2017 has now also been added.
It looks as though there has been a long-running confusion regarding site names and codes for the German stations. As far as I can make out, the following changes are required:
Station DE34, which is currently named "Odenwald, Schmerbach 1" in our database, should actually be named "Odenwald, Rombach 4"
Station DE35, which is currently named "Taunus, Rombach 4" in our database, should actually be labelled "DE36" and named "Odenwald, Schmerbach 1"
Station DE11, which is currently named "Schwarzwald, Kleine Kinzig" in our database, should actually be labelled "DE35" and named "Schwarzwald, Kleine Kinzig Huettenhardt"
I have made these changes, as well as updating the station properties with the NFC codes provided by Jens. Hopefully this will help to avoid confusion in the future.
Note: Changing site names/codes like this is quite dangerous, as there is a good chance of data becoming associated with the wrong station. It might be worth getting Jens to check the data we have after this clean-up, or perhaps submitting an entirely new dataset if that's not too much work.
Other points to note:
Monitoring has stopped at several of the German stations, and we also seem to have a gap in data submission between 2013 and 2016 (giving an actual data gap from the end of 2012 to the end of 2014)
For the lakes, Jens has provided data from a wide variety of depths. We are primarily interested in surface samples, but restricting the data to just depth = 0 results in a very limited dataset. As with some other countries, I have therefore filtered the data to include samples (including "mixed samples") from less than or equal to 2 m depth.
Data tidied by Cathrine - see e-mail received 18.02.2019.
Data tidied by Cathrine - see e-mail received 18.02.2019. I have created two new stations (MD01 and MD02) and associated them with a new project ('ICPWaters MD'
).
Data tidied by Espen - see e-mail received 18.02.2019.
Data tidied by Espen - see e-mail received 18.02.2019. I have created 6 new stations (IT07 to IT12) and added them to the existing 'ICPWaters IT'
project.
Data tidied by Espen - see e-mail received 18.02.2019.
The latest Swedish data are available from the MVM database. For consistency, it seems sensible to update the entire data series using the API, thereby ensuring that NIVA has the most recent quality assured data. The file
.\ICP_Waters\Call_For_data_2018\replies\sweden\sweden_core_stations_mvm_codes_feb_2019.xlsx
lists the 22 Swedish stations in the "core" programme. I have deleted all the data associated with these stations. The code below then uses my MVM_Python package to download updated series from the Swedish database and add them to RESA.
In [3]:
# Get a list of the Swedish sites
stn_xlsx = (r'../../../Call_For_data_2018/replies'
r'/sweden/sweden_core_stations_mvm_codes_feb_2019.xlsx')
swe_df = pd.read_excel(stn_xlsx)
print (len(swe_df))
swe_df.head()
Out[3]:
In [4]:
# Lookup table matching MVM pars to RESA methods
par_map_xlsx = (r'../../../Sweden_MVM_API/map_mvm_pars_to_resa2.xlsx')
par_map_df = pd.read_excel(par_map_xlsx)
par_map_df
Out[4]:
In [5]:
def f(row):
""" Function to deal with flags.
"""
if '<' in row['value_']:
val = '<'
elif '>' in row['value_']:
val = '>'
else:
val = np.nan
return val
In [ ]:
# # Path to local file containing MVM access tokens
# token_xlsx = r'../../../Sweden_MVM_API/md_mvm_tokens.xlsx'
# # Time period of interest
# st_yr = 1970
# end_yr = 2016
# # Loop over stations
# for idx, row in list(swe_df.iterrows()):
# # Get IDs
# resa_id = row['station_id']
# mvm_id = row['nfc_code']
# name = row['station_name']
# print('Processing: %s' % name)
# print (' Getting data from MVM...')
# # Get data from MVM
# mvm_df = mvm_python.query_mvm_station_data(mvm_id,
# st_yr,
# end_yr,
# token_xlsx)
# print (' Restructuring...')
# # Get pars of interest
# mvm_df = pd.merge(mvm_df, par_map_df,
# how='inner',
# left_on='par_name',
# right_on='mvm_par')
# # Convert units
# mvm_df['value'] = mvm_df['value'] * mvm_df['factor']
# # If multiple depths are available, get the shallowest
# mvm_df.sort_values(by=['depth1', 'depth2'], inplace=True)
# mvm_df.drop_duplicates(subset=['mvm_id', 'sample_date', 'par_name'],
# inplace=True)
# # Get just cols of interest
# mvm_df = mvm_df[['sample_date', 'icpw_method_id', 'value']]
# # Occasionally, there are still some duplicates (e.g. Tot-N_ps and Tot-N_TNb)
# # at the same site-date. Average these for now
# mvm_df = mvm_df.groupby(['sample_date',
# 'icpw_method_id']).agg({'value':'mean'}).reset_index()
# # Sometimes values of 0 are reported. Drop these as dubious
# mvm_df['value'].replace(0, np.nan, inplace=True)
# mvm_df.dropna(subset=['value'], inplace=True)
# # Make sure the df can be pivoted (integrity check - don't actually need
# # to pivot, so result is not saved as variable)
# mvm_df.pivot(index='sample_date', columns='icpw_method_id')
# # Build water samp df
# ws_df = mvm_df[['sample_date']].copy()
# ws_df['station_id'] = resa_id
# ws_df['depth1'] = 0
# ws_df['depth2'] = 0
# ws_df.drop_duplicates(subset=['sample_date'], inplace=True)
# ws_df.sort_values(by='sample_date', inplace=True)
# ws_df = ws_df[['station_id', 'sample_date', 'depth1', 'depth2']]
# # Add water samples to db
# print (' Writing data to WATER_SAMPLES table...')
# dtypes = {c:types.VARCHAR(ws_df[c].str.len().max())
# for c in ws_df.columns[ws_df.dtypes == 'object'].tolist()}
# ws_df.to_sql(name='water_samples',
# schema='resa2',
# con=eng,
# if_exists='append',
# index=False,
# dtype=dtypes)
# # Get sample_ids back from db
# print (' Identifying sample IDs...')
# sql = ('SELECT water_sample_id, station_id, sample_date '
# 'FROM resa2.water_samples '
# 'WHERE station_id = %s' % resa_id)
# ws_df = pd.read_sql_query(sql, eng)
# print (' Checking data integrity...')
# # Join sample id to chemistry
# chem_df = pd.merge(mvm_df, ws_df, how='left', on='sample_date')
# # Get cols of interest
# chem_df = chem_df[['water_sample_id', 'icpw_method_id', 'value']]
# chem_df.columns = ['sample_id', 'method_id', 'value_']
# # Drop NaNs
# chem_df.dropna(how='any', inplace=True)
# # Deal with flags
# chem_df['value_'] = chem_df['value_'].astype(str)
# chem_df['flag1'] = chem_df.apply(f, axis=1)
# # Extract numeric chars
# chem_df['value'] = chem_df['value_'].str.extract("([-+]?\d*\.\d+|\d+)", expand=True)
# chem_df['value'] = chem_df['value'].astype(float)
# del chem_df['value_']
# # Reorder cols
# chem_df = chem_df[['sample_id', 'method_id', 'value', 'flag1']]
# # Check flags are consistent
# if not pd.isnull(chem_df['flag1']).all():
# if not set(chem_df['flag1'].unique()).issubset(['<', '>', np.nan]):
# print ('Some flags are not valid:')
# print (chem_df['flag1'].unique())
# # Add chem to db
# print (' Writing data to WATER_CHEMISTRY_VALUES2 table...')
# dtypes = {c:types.VARCHAR(chem_df[c].str.len().max())
# for c in chem_df.columns[chem_df.dtypes == 'object'].tolist()}
# chem_df.to_sql(name='water_chemistry_values2',
# schema='resa2',
# con=eng,
# if_exists='append',
# index=False,
# dtype=dtypes)
# print (' Done.')
As described in the trends notebook John has supplied an entirely new set of data for the US, but some of the NFC codes have changed. I have therefore updated the codes in our database to reflect these changes. The latest station codes for the 95 "core" US sites are given in
.\ICP_Waters\Call_For_data_2018\usa_trends_and_core_sites.xlsx
Note that John's most recent dataset does not include any data for 4 of the "core" stations:
US82, US94, US125 and X15:1C1-106
because these are no longer active.
I have deleted all the old data associated with the US sites, except for the 4 stations listed above. The code below reads John's new files and creates an Excel worksheet in the same format as the ICPW template, which can then be added back to the database.
In [6]:
# Get list of US "core" stations
stn_xlsx = r'../../../Call_For_data_2018/usa_trends_and_core_sites.xlsx'
stn_df = pd.read_excel(stn_xlsx, sheet_name='Stations')
us_cds = list(stn_df['nfc_code'].unique())
# Exclude inactive stations ['US82', 'US94', 'US125', 'X15:1C1-106']
inactive = ['ME-9998E', '1C1-092', '1C1-096', '1C1-106E']
us_cds = [i for i in us_cds if i not in inactive]
len(us_cds) # Should be 91
Out[6]:
In [7]:
# Read stream data
riv_xl_path = r'../../../Call_for_Data_2018/replies/usa/Stoddard LTM streams_2018.xls'
riv_df = pd.read_excel(riv_xl_path, sheet_name='LTM stream data 1979-2016')
# Tidy
del riv_df['PROGRAM'], riv_df['YR']
del riv_df['ALDS'], riv_df['ALOR']
del riv_df['DIC'], riv_df['FLOW']
# Rename
riv_df.rename({'SITE_ID':'Code',
'DATE_TIME':'Date',
'ALTD':'TAL',
'ANC':'ALKGMIKRO',
'CA':'Ca',
'CL':'Cl',
'COND':'K25X',
'DOC':'TOC',
'MG':'Mg',
'NA':'Na',
'NH4':'NH4N',
'NO3':'NO3N',
'PH':'pH',
'SIO2':'SiO2',
'SO4':'SULF',
'WTEMP':'TEMP'},
inplace=True,
axis=1)
# Convert units
riv_df['Ca'] = riv_df['Ca']*40.1/(1000*2) # ueq/l to mg/l
riv_df['Cl'] = riv_df['Cl']*35.45/(1000*1) # ueq/l to mg/l
riv_df['K'] = riv_df['K']*39.1/(1000*1) # ueq/l to mg/l
riv_df['Mg'] = riv_df['Mg']*24.3/(1000*2) # ueq/l to mg/l
riv_df['Na'] = riv_df['Na']*23./(1000*1) # ueq/l to mg/l
riv_df['NH4N'] = riv_df['NH4N']*14. # ueq/l to ugN/l
riv_df['NO3N'] = riv_df['NO3N']*14. # ueq/l to ugN/l
riv_df['SULF'] = riv_df['SULF']*96.065/(1000*2) # ueq/l to mgSO4/l
riv_df.head()
Out[7]:
In [8]:
# Read lake data
la_xl_path = r'../../../Call_for_Data_2018/replies/usa/Stoddard LTM lakes_2018.xls'
la_df = pd.read_excel(la_xl_path, sheet_name='LTM lake data 1980-2016')
# Fill NaN in DEPTH column with 0
la_df['DEPTH'] = la_df['DEPTH'].fillna(0)
# Get surface only
la_df = la_df.query("DEPTH <= 1")
# Tidy
del la_df['PROGRAM'], la_df['SAMTYPE']
del la_df['year'], la_df['month'], la_df['day']
del la_df['ALDS'], la_df['ALOR'], la_df['CHLA']
del la_df['DIC'], la_df['PHEQ'], la_df['DEPTH']
# Rename
la_df.rename({'SITE_ID':'Code',
'DATE_TIME':'Date',
'AL':'TAL',
'ANC':'ALKGMIKRO',
'CA':'Ca',
'CL':'Cl',
'COLOR':'COLOUR',
'COND':'K25X',
'DOC':'TOC',
'MG':'Mg',
'NA':'Na',
'NH4':'NH4N',
'NO3':'NO3N',
'PH':'pH',
'PTL':'TOTP',
'SIO2':'SiO2',
'SO4':'SULF',
'WTEMP':'TEMP'},
inplace=True,
axis=1)
# Convert units
la_df['Ca'] = la_df['Ca']*40.1/(1000*2) # ueq/l to mg/l
la_df['Cl'] = la_df['Cl']*35.45/(1000*1) # ueq/l to mg/l
la_df['K'] = la_df['K']*39.1/(1000*1) # ueq/l to mg/l
la_df['Mg'] = la_df['Mg']*24.3/(1000*2) # ueq/l to mg/l
la_df['Na'] = la_df['Na']*23./(1000*1) # ueq/l to mg/l
la_df['NH4N'] = la_df['NH4N']*14. # ueq/l to ugN/l
la_df['NO3N'] = la_df['NO3N']*14. # ueq/l to ugN/l
la_df['SULF'] = la_df['SULF']*96.065/(1000*2) # ueq/l to mgSO4/l
la_df.head()
Out[8]:
In [9]:
# Combine river and lake data
us_df = pd.concat([riv_df, la_df], axis=0, sort=False)
# Filter to just sites in core project
us_df = us_df[us_df['Code'].isin(us_cds)]
# Join ICPW codes
us_df = pd.merge(us_df, stn_df[['station_code', 'nfc_code']],
left_on='Code', right_on='nfc_code')
# Average duplicates
us_df = us_df.groupby(['station_code', 'Date']).mean()
# Tidy
us_df.reset_index(inplace=True)
# Check we have data for all sites
assert len(us_df['station_code'].unique()) == len(us_cds), 'Some stations have no data.'
# Rename
us_df.rename({'station_code':'Code'},
inplace=True,
axis=1)
# Save to Excel
out_xl = r'../../../Call_for_Data_2018/replies/usa/usa_lake_stream_all_core_to_2016_tidied.xlsx'
us_df.to_excel(out_xl, sheet_name='Data', index=False)
us_df.head()
Out[9]:
In [ ]:
# # Delete existing data for 91 stations
# # Get station IDs
# sql = ("SELECT station_id FROM resa2.stations "
# "WHERE station_code IN %s" % str(tuple(list(us_df['Code'].unique()))))
# id_df = pd.read_sql(sql, eng)
# # Delete from SAMPLE_SELECTIONS
# print ('Deleting from SAMPLE_SELECTIONS')
# sql = ("DELETE FROM resa2.sample_selections "
# "WHERE water_sample_id IN ( "
# " SELECT water_sample_id FROM resa2.water_samples "
# " WHERE station_id IN %s)" % str(tuple(list(id_df['station_id'].unique()))))
# eng.execute(sql)
# # Delete from WCV2
# print ('Deleting from WATER_CHEMISTRY_VALUES2')
# sql = ("DELETE FROM resa2.water_chemistry_values2 "
# "WHERE sample_id IN ( "
# " SELECT water_sample_id FROM resa2.water_samples "
# " WHERE station_id IN %s)" % str(tuple(list(id_df['station_id'].unique()))))
# eng.execute(sql)
# # Delete from WATER_SAMPLES
# print ('Deleting from WATER_SAMPLES')
# sql = ("DELETE FROM resa2.water_samples "
# "WHERE station_id IN %s" % str(tuple(list(id_df['station_id'].unique()))))
# eng.execute(sql)
There are 6 UK stations within the "core" programme. As described in the "trends" notebook, some issues have been identified with the historic data already in the database. I will therefore delete all the old data and upload corrected values compatible with the most recent trends work. For the UK, the "core" stations are a subset of the "trends" stations, so it's just a question of uploading a cut-down version of the cleaned "trends" dataset. I have created a new Excel file for just the "core" stations here
.\ICP_Waters\Call_For_data_2018\replies\uk\icpw_uk_core_all.xlsx
This has been uploaded to the database.
In [ ]:
# # Delete existing data for 6 UK stations
# # Delete from SAMPLE_SELECTIONS
# print ('Deleting from SAMPLE_SELECTIONS')
# sql = ("DELETE FROM resa2.sample_selections "
# "WHERE water_sample_id IN ( "
# " SELECT water_sample_id FROM resa2.water_samples "
# " WHERE station_id IN (23608, 23609, 23610, 23611, 23612, 36165))")
# eng.execute(sql)
# # Delete from WCV2
# print ('Deleting from WATER_CHEMISTRY_VALUES2')
# sql = ("DELETE FROM resa2.water_chemistry_values2 "
# "WHERE sample_id IN ( "
# " SELECT water_sample_id FROM resa2.water_samples "
# " WHERE station_id IN (23608, 23609, 23610, 23611, 23612, 36165))")
# eng.execute(sql)
# # Delete from WATER_SAMPLES
# print ('Deleting from WATER_SAMPLES')
# sql = ("DELETE FROM resa2.water_samples "
# "WHERE station_id IN (23608, 23609, 23610, 23611, 23612, 36165)")
# eng.execute(sql)
The existing "core" project includes 10 stations in Ireland, but 7 of these (all the river sites) are no longer monitored. Julian has proposed a further 11 lake sites as replacements, which gives 21 Irish sites in total, but only 14 of them with active monitoring.
All the station locations and metadata for the Irish sites are given in
.\ICP_Waters\Call_for_Data_2017\Replies\ireland\ireland_20170929.xlsx
Note that for many of the sites there are several possible monitoring points (e.g. on the lake shore or out in open water), but all of them are given the same geographic co-ordinates in Julian's spreadsheet (i.e. a "station" comprises several nearby monitoring "points"). For ICPW, we have historically focused on the open water sites, so I have filtered the shoreline samples out from Julian's dataset. In some cases, there are several open water sampling points per station as well. This leads to duplicates, because the same site has been monitoried more than once on the same date, but at different points. These duplicated values are usually very similar, and differences will therefore be ignored.
I have created 11 new sites in the database based on the details in the Excel file above. This Excel file below contains the aggregated data supplied by Julian for the period from 2007 to 2017
.\ICP_Waters\Call_For_data_2018\replies\ireland\ireland_icpw_2007-2017_jes_tidied.xls
Note: While adding the data to the database, I have noticed that historic values (i.e. those already in the database) for alkalinity look very high. From a brief check, I suspect many values are a factor of 1000 too large. This will need investigating further before undertaking the trends analysis - see section 2.6, below.
There are currently four Polish stations within the "core" programme (PL01 to PL04), but these have not been monitored since 2012 (see e-mail from Rafal received 17.06.2017 at 17.43). Instead, he has suggested 5 new stations and supplied data for these from 1999 to 2017. Details for the new sites are given in
.\ICP_Waters\Call_for_Data_2016\Replies\poland\poland.xlsx
I have therefore created new stations with codes PL05 to PL09, together with a tidied data sheet here
.\ICP_Waters\Call_For_data_2018\replies\poland\icpw_poland_post-2012_jes_tidied.xls
which has been added to the database.
There are clearly some issues with the raw data in the database. Before going further with the trends analysis, I'd like to try to catch as many of the data errors as possible. For now, I'll focus on the main parameters of interest for the trends report: SO4, base cations (Ca and Mg), NO3, Alkalinity, ANC (calculated from Ca, Mg, K, Na, NH4, Cl, SO4 and NO3), DOC/TOC and pH.
In [10]:
# Select projects
prj_grid = nivapy.da.select_resa_projects(eng)
prj_grid
In [11]:
# Select all relevant projects
prj_df = prj_grid.get_selected_df()
print (len(prj_df), 'projects selected.')
prj_df
Out[11]:
Note: There are a total of 25 countries that have submitted data to ICPW in the past, but only the 15 above are suitable for the trends analysis. For reference, the others are:
Country | Comment |
---|---|
Armenia | Data only from 2004 - 2008. Does not use template. No response to recent calls for data |
Austria | Only 1 station. No data since 2012 |
Belarus | Only 1 station. No response to recent calls for data |
France | No longer involved in project. Never supplied data? |
Hungary | No longer involved in project. Never supplied data? |
Montenegro | Data only from 2006 - 2009. Does not use template. No response to recent calls for data |
Netherlands | Currently being processed by Cathrine |
Russia | Data only from 2009 - 2014. Does not use template. No response to recent calls for data |
Slovakia | Supplied as part of "trends" work. Could also be used in "core", but some overlap with Polish sites? |
Spain | No data supplied. No response to recent calls for data |
Of these, it should eventually be possible to include data from the Netherlands, and maybe Slovakia too.
In [12]:
# Get stations for these projects
stn_df = nivapy.da.select_resa_project_stations(prj_df, eng)
print (len(stn_df), 'stations associated with the selected projects.')
stn_df.head()
Out[12]:
In [13]:
# Map
nivapy.spatial.quickmap(stn_df,
popup='station_code',
aerial_imagery=True)
Out[13]:
In [14]:
# Get available parameters
st_dt = '1970-01-01'
end_dt = '2019-01-01'
par_grid = nivapy.da.select_resa_station_parameters(stn_df,
st_dt,
end_dt,
eng)
par_grid
In [15]:
# Select relevant parameters
par_df = par_grid.get_selected_df()
par_df
Out[15]:
In [16]:
# Get chem
wc_df, dup_df = nivapy.da.select_resa_water_chemistry(stn_df,
par_df,
st_dt,
end_dt,
eng,
drop_dups=True,
lod_flags=False)
wc_df.head()
Out[16]:
In [17]:
# Save for speed
wc_df.to_csv('working_chem.csv', index=False, encoding='utf-8')
In [18]:
# Load saved data
wc_df = pd.read_csv('working_chem.csv', encoding='utf-8')
wc_df['sample_date'] = pd.to_datetime(wc_df['sample_date'])
In [19]:
# Which stations have duplicates?
dup_df['station_code'].unique()
Out[19]:
There are duplicated values for stations in Norway, Canada, Germany, Estonia and Latvia. The duplicates in Norway are due to the way the NIVA lab handles reanalyses, and my code is designed to query these correctly. For the other countries, the problem is due to submitting both DOC and TOC data for the same water sample. These are added to our database under separate methods, but for ICPW both TOC and DOC get mapped to TOC in the output, because ICPW does not distinguish between the two. My code will currently pick one of the duplicates at random, which is not ideal. However, as long as the ICPW assumption that TOC ~ DOC is correct, this shouldn't actually cause problems. (And if this assumption is not correct, we have broader issues to deal with in the ICPW database).
From above, there are 261 stations associated with the selected projects. Do they all have at least some relevant data?
In [20]:
print(len(wc_df['station_id'].unique()), 'stations with data.')
In [21]:
# Identify missing station
set(stn_df['station_id']) - set(wc_df['station_id'])
Out[21]:
In [22]:
stn_df.query('station_id == 38562')
Out[22]:
So, the only station with no data is IE20 - one of the new Irish lakes. I have checked Julian's original data submission and this looks OK: the only data he supplied for IE20 is from the shoreline, and these samples have been deliberately excluded - see section 1.13, above.
The code below calculates the number of water samples per year per station. A bar plot is produced for each country, where the height of each bar shows the average number of samples per year and the error bars show the minimum and maximum. For example, in Norway during 1973, the minimum number of samples was 146 (at Langtjern) and the maximum number was 368 (at Birkenes).
In [23]:
# Get countries from proj names
sql = ("SELECT station_id, "
" SUBSTR(b.project_name, -2, 2) AS country "
"FROM resa2.projects_stations a "
"LEFT JOIN resa2.projects b "
"ON a.project_id = b.project_id "
"WHERE station_id IN %s "
"AND SUBSTR(b.project_name, 1, 9) = 'ICPWaters'"% str(tuple(list(stn_df['station_id']))))
cnt_df = pd.read_sql(sql, eng)
# Join to chem
wc_df = pd.merge(wc_df, cnt_df, on='station_id')
In [24]:
# Extract year
wc_df['year'] = wc_df['sample_date'].dt.year
# Number of samples per year per site
agg = wc_df.groupby(['station_id', 'country', 'year']).agg({'station_code':'count'}).reset_index()
agg.columns = ['station_id', 'country', 'year', 'count']
# Plot
g = sn.catplot(data=agg,
x='year',
y='count',
row='country',
kind='bar',
ci=100,
sharey=False,
aspect=5)
for ax in g.axes:
ax[0].xaxis.set_tick_params(which='both', labelbottom=True)
plt.tight_layout()
The plots above illustrate the wide variability in monitoring across the ICPW programme. Some points to note:
The Norwegian sites have been monitored very consistently since the the 1980s (about 1 sample per week), as have some of the Canadian sites. It may be possible to try more sophisticated/detailed trend analyses using these high-frequency data
Germany, Poland, Sweden and the USA (plus some Canadian sites) typically have monthly monitoring i.e. still quite detailed
Switzerland, Estonia, Finland, Ireland, Italy, Latvia, the UK and the Czech Republic typically have fewer than 12 samples per year. This data is less detailed, and probably only suitable for calculating annual averages (rather than attempting to capture e.g. seasonal effects)
The dataset for Moldova currently only covers the period from 2014 to 2017. It is therefore not yet suitable for estimating trends
In [25]:
# Melt to long format
wc_lng = wc_df.melt(id_vars=['station_id', 'station_code',
'station_name', 'country',
'sample_date', 'year',
'depth1', 'depth2'],
var_name='param')
# Plot
g = sn.catplot(data=wc_lng,
x='country',
y='value',
row='param',
kind='box',
aspect=4,
sharey=False)
for ax in g.axes:
ax[0].xaxis.set_tick_params(which='both', labelbottom=True)
plt.tight_layout()
Note the following:
There are clearly issues with the alkalinity data for the Irish and Italian stations. As mentioned in Section 1, I suspect there may be unit errors in some of the old templates (the recent ones look OK), which have resulted in values being uploaded that are a factor of 1000 too large. Needs further investigation
Alkalinity for the Norwegian sites (from NIVA's lab) is reported in mmol/l. These values need converting to ueq/l in order to be compatible with data from the Focal Centres
The data from Moldova is very different to that from the other ICPW sites. I don't know much about Moldova, but I guess the values are plausible if the regional is strongly influenced by marine processes and/or limestone geology? These sites don't have enough data for inclusion in the trends work, but check the values look reasonable with Øyvind nevertheless
There are a few very high Cl concentrations in a few of the German and Italian samples. Are these values plausible, or likely contaminated?
There are very high ammonium concentrations at some of the German sites, and very high nitrate concentrations in both Germany and Latvia. The nitrate values are plausible for heavily impacted (e.g. agricultural) systems, but they look a bit out of place within ICPW. Are these values reasonable? Some of the samples have NO3-N concentrations well above the drinking water limit specified by the Nitrates Directive (11.3 mg-N/l), which seems pretty high for supposedly "pristine" catchments
There are clearly issues with the TOC values from some of the US stations. This problem was previously identified during data preparation for the "TOC trends" paper and John is following it up (see e-mail from John received 15.02.2019 at 21.54 for details)
Some pH values from Latvia and Germany are very low. Did acidification really cause pH values as low as 3?
Checking of the Irish data shows that, in the past, alkalinity has been supplied both as meq/l and as ueq/l. It seems pretty obvious that errors have been made regarding the units in some cases, such that values in ueq/l have been reported as meq/l, which results in an incorrect multiplier of 1000 being applied. Looking at the distribution of alkalinity values for the Irish sites, the following SQL can be used to identify and fix the problem (by switching the method from meq/l to ueq/l).
UPDATE RESA2.WATER_CHEMISTRY_VALUES2
SET method_id = 10298
WHERE sample_id IN
(SELECT water_sample_id
FROM RESA2.WATER_SAMPLES
WHERE station_id IN
( SELECT station_id FROM RESA2.PROJECTS_STATIONS WHERE project_id = 3445
)
)
AND method_id = 10297
AND value > 2;
Similarly, during 2010 to 2012, it looks as though alkalinity values in ueq/l for sites IT01 to IT06 have been mistakenly entered into the database as meq/l. This has been fixed using the following SQL
UPDATE RESA2.WATER_CHEMISTRY_VALUES2
SET method_id = 10298
WHERE sample_id IN
(SELECT water_sample_id
FROM RESA2.WATER_SAMPLES
WHERE station_id IN
( SELECT station_id FROM RESA2.PROJECTS_STATIONS WHERE project_id = 2987
)
)
AND method_id = 10297
AND value > 10;
In [7]:
## Apply factor of 1/83.26 to DOC
#with eng.begin() as conn:
# sql = ("UPDATE resa2.water_chemistry_values2 "
# "SET value = value/83.26 "
# "WHERE sample_id IN ( "
# " SELECT water_sample_id "
# " FROM resa2.water_samples "
# " WHERE station_id IN (23658, 23659, 23660, 23661) "
# " AND sample_date > DATE '2014-10-01') "
# "AND method_id = 10273")
# conn.execute(sql)
In [8]:
## Apply factor of 26.98 to Al
#with eng.begin() as conn:
# sql = ("UPDATE resa2.water_chemistry_values2 "
# "SET value = value*26.98 "
# "WHERE sample_id IN ( "
# " SELECT water_sample_id "
# " FROM resa2.water_samples "
# " WHERE station_id IN (23658, 23659, 23660, 23661) "
# " AND sample_date > DATE '2014-10-01') "
# "AND method_id = 10249")
# conn.execute(sql)