In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import nivapy3 as nivapy
import seaborn as sn
import matplotlib.pyplot as plt
import warnings
import toc_trends_analysis as toc_trends
#warnings.filterwarnings("ignore")
plt.style.use('ggplot')
In [2]:
# Connect to NIVABASE
eng = nivapy.da.connect()
Øyvind's outline for the Thematic Trends Report includes the following:
Trends in surface water chemistry 1990-2016
a. General overview of trend analysis (spaghetti plots?)
b. Time series from sites sampled frequently (at least monthly)
i. Trends in annual means
ii. Trends in severity of episodes
iii. Trends in climate
iv. Search for breaking points (periods of significant change)
c. Regional trends (all sites)
In addition, it has been decided to combine data from the "core" ICPW dataset with that from the expanded "trends" work. This is not straightforward, as there is significant overlap between the two datasets, and in most cases we have mutliple time series for the same site. I would like to move towards using a single "definitive" series for each station, but am reluctant to delete/remove the duplicated series without more thorough checking. In the medium term, I will preferentially use data from the "trends" stations where possible (this has been more thoroughly checked), and will only use the "core" series for stations that are not part of the trends work.
As a starting point, I have generated a single Excel listing all the ICPW stations for which we have data of reasonable quality. This dataset comprises all the "trends" locations, plus stations from 16 countries in the "core" dataset (the 15 countries listed here, plus the Netherlands). For the present, this can be taken as a definitve list of stations within the ICPW project.
In [3]:
# Read stations
stn_path = r'../../../all_icpw_sites_may_2019.xlsx'
stn_df = pd.read_excel(stn_path, sheet_name='all_icpw_stns')
# Check stn numbners seem OK
trend_df = stn_df.query("group in ('Trends', 'Trends+Core')")
core_df = stn_df.query("group in ('Core', 'Trends+Core')")
print(f'There are {len(stn_df)} unique stations within the ICPW project as a whole.')
print(f'There are {len(core_df)} stations within the ICPW "core" project.')
print(f'There are {len(trend_df)} stations within the ICPW "trends" project.')
print('')
stn_df.head()
Out[3]:
In [4]:
# Map
nivapy.spatial.quickmap(stn_df,
cluster=True,
popup='station_code',
aerial_imagery=True)
Out[4]:
For now, I'll consider the same raw water chemistry parameters as here, but excluding alkalinity.
In [5]:
# User options
st_dt = '1990-01-01'
end_dt = '2016-12-31'
# Pars of interest
pars = ['TOC', 'EH', 'ECl', 'ESO4', 'ESO4X', 'ENO3', 'ECa_EMg', 'ECaX_EMgX', 'ANC', 'ANCX', 'ALK-E']
par_list = [1, 5, 6, 7, 8, 11, 12, 13, 14, 49, 65] # All raw pars required to calc those above
# Get chem
wc_df, dup_df = nivapy.da.select_resa_water_chemistry(stn_df,
par_list,
st_dt,
end_dt,
eng,
drop_dups=True,
lod_flags=False)
wc_df.head()
Out[5]:
In [6]:
# Some concentrations are less than 0, which doesn't make sense
# Set back to zero
cols = ['Ca_mg/l', 'Cl_mg/l', 'K_mg/l', 'Mg_mg/l', 'NH4-N_µg/l N',
'NO3-N_µg/l N', 'Na_mg/l', 'SO4_mg/l', 'TOC_mg C/l', 'pH_']
for col in cols:
wc_df[col][wc_df[col]<0] = 0
In [7]:
# Remove units from col names
idx_cols = ['station_id', 'station_code', 'station_name', 'sample_date', 'depth1', 'depth2']
wc_df.columns = idx_cols + [i.split('_')[0] for i in wc_df.columns if i not in idx_cols]
# Values in the raw data equal to exactly zero must be wrong
for col in wc_df.columns:
if col not in idx_cols:
wc_df[col] = wc_df[col].replace(0, np.nan)
# Convert units and apply sea-salt corrections etc.
wc_df = toc_trends.conv_units_and_correct(wc_df)
wc_df.set_index(idx_cols, inplace=True)
wc_df = wc_df[pars]
wc_df.reset_index(inplace=True)
wc_df.head()
Out[7]:
For the regions 'SoNord', 'UK-IE-NL' and 'AtlCan', we want to use the sea-salt corrected data (X
suffix). Elsewhere, we will use uncorrected values.
In [8]:
# Regions to correct
sscor_regs = ['SoNord', 'UK-IE-NL', 'AtlCan']
# Join regions
wc_df = pd.merge(wc_df,
stn_df[['station_id', 'region']],
how='left',
on='station_id')
# Split dataset
sscor_df = wc_df.query("region in @sscor_regs").copy()
nocor_df = wc_df.query("region not in @sscor_regs").copy()
# Get relevant cols for each dataset. Rename Remove 'X' suffix from corrceted data
sscor_cols = ['TOC', 'ECl', 'EH', 'ESO4X', 'ENO3', 'ECaX_EMgX', 'ANCX', 'ALK-E']
nocor_cols = ['TOC', 'ECl', 'EH', 'ESO4', 'ENO3', 'ECa_EMg', 'ANC', 'ALK-E']
nocor_df = nocor_df[idx_cols+nocor_cols]
sscor_df = sscor_df[idx_cols+sscor_cols]
sscor_df.columns = idx_cols+nocor_cols
# Combine
wc_df = pd.concat([nocor_df, sscor_df],
axis=0,
sort=False).reset_index(drop=True)
wc_df.head()
Out[8]:
For some locations - most notably Ireland - the sea-salt correction produces negative values for some parameters. We have discussed this fairly extensively (see e.g. here and the e-mail received from Øyvind on 24.06.2019 at 13.44). Our current decision is to accept these negative values as indicating high uncertainty, but still reflecting our "best guess" from which to base the trend analyses. The code cell below is therefore commented out.
In [9]:
# TOC, EH, ESO4X, ENO3 and ECaX_EMgX should be positive
# Set values < 0 to zero
#for col in ['TOC', 'EH', 'ESO4', 'ENO3', 'ECaX_EMg']:
# wc_df[col][wc_df[col] < 0] = 0
In [10]:
# In the past, some German stations have reported depths > 0. Remove these
wc_df = wc_df.query("(depth1 == 0) and (depth2 == 0)")
In [11]:
# Save for speed
wc_csv = r'../../../Thematic_Trends_Report_2019/working_chem.csv'
wc_df.to_csv(wc_csv,
index=False,
encoding='utf-8')
In [12]:
# Load saved data
wc_csv = r'../../../Thematic_Trends_Report_2019/working_chem.csv'
wc_df = pd.read_csv(wc_csv, encoding='utf-8')
wc_df['sample_date'] = pd.to_datetime(wc_df['sample_date'])
In [13]:
# Which stations have duplicates?
dup_df['station_code'].unique()
Out[13]:
As described here, these duplicates are due to some countries reporting both DOC and TOC, which are not distinguished within ICPW.
In [14]:
# How many stations have data?
print(len(stn_df), 'stations in total.')
print(len(wc_df['station_id'].unique()), 'stations with data.')
# Identify stns missing data
print('\nStations without data:')
no_data_stns = set(stn_df['station_id']) - set(wc_df['station_id'])
stn_df.query('station_id in @no_data_stns')
Out[14]:
This makes sense: IE20 has no data for the reasons described here. The six Italian stations were proposed in 2018 and monitoring did not begin until 2017, so they are not relevant here.
In [15]:
# Extract year
wc_df2 = pd.merge(wc_df,
stn_df[['station_id', 'country']],
how='left',
on='station_id')
wc_df2['year'] = wc_df2['sample_date'].dt.year
# Number of samples per year per site
agg = wc_df2.groupby(['station_id', 'country', 'year']).agg({'station_code':'count'}).reset_index()
agg.columns = ['station_id', 'country', 'year', 'count']
# Plot
g = sn.catplot(data=agg,
x='year',
y='count',
row='country',
kind='bar',
ci=95,
sharey=False,
aspect=5)
for ax in g.axes:
ax[0].xaxis.set_tick_params(which='both', labelbottom=True)
plt.tight_layout()
In [16]:
# Get the min, med and max samples per year for each country
agg.groupby('country').agg({'count': ['min', 'median', 'max']})
Out[16]:
In [17]:
# At least 25 samples per year
hi_freq_df = stn_df[stn_df['station_id'].isin(agg.groupby('station_id').agg({'count':'min'}).query("count > 25").index)]
out_csv = r'../../../Thematic_Trends_Report_2019/hi_freq_stns.csv'
hi_freq_df.to_csv(out_csv, index=False, encoding='utf-8')
hi_freq_df
Out[17]:
In [18]:
# Melt to long format
wc_lng = wc_df2.melt(id_vars=['station_id', 'station_code',
'station_name', 'country',
'sample_date', 'year',
'depth1', 'depth2'],
var_name='param')
# Plot
g = sn.catplot(data=wc_lng,
x='country',
y='value',
row='param',
kind='box',
aspect=4,
sharey=False)
for ax in g.axes:
ax[0].xaxis.set_tick_params(which='both', labelbottom=True)
plt.tight_layout()
In [19]:
# Summary stats for df
wc_df.describe()
Out[19]:
In [20]:
# Check regions with negative values
for col in ['TOC', 'ECl', 'EH', 'ESO4', 'ENO3', 'ECa_EMg', 'ANC', 'ALK-E']:
if wc_df[col].min() < 0:
stn_ids = wc_df[wc_df[col]<0]['station_id'].unique()
print('%s is less than zero in the following regions:' % col)
print(' ', stn_df.query('station_id in @stn_ids')['region'].unique())
ANC and alkalinity can both be negative, so the above output looks OK. We only have negative ESO4 and ECa_EMg in regions where sea-salt corrections have been applied, as described above.