In [1]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, os, glob, numpy as np, imp
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sqlalchemy import create_engine
sn.set_context('notebook')

TOC trends 2015: database clean-up (part 2)

This notebook continues the work detailed here.

It also describes my follow-up to the 2016 call for data, which has highlighted a few more issues that need correcting.

1. Distance to coast

In the previous notebook I calculated distance to coast values, but I haven't yet added them to RESA2. Begin by defining a new parameter (var_id = 319) in RESA2.STATION_PARAMETER_DEFINITIONS. Next I need to restructure the distance information.


In [5]:
# Read distances to coastline table
dist_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
            r'\Data\distance_to_coast.csv')
dist_df = pd.read_csv(dist_csv)

# Restructure columns to [stn_id, var_id, value, entered_by, entered_date]
dist_df['VALUE'] = dist_df['distance_m'] / 1000.
dist_df['STATION_ID'] = dist_df['station_id']
dist_df['VAR_ID'] = 319
dist_df['ENTERED_BY'] = 'JES'
dist_df['ENTERED_DATE'] = '28.06.2016'

dist_df = dist_df[['STATION_ID', 'VAR_ID', 'VALUE', 'ENTERED_BY', 'ENTERED_DATE']]

# Save to file
out_csv = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\Data\distance_to_coast_upload.csv')
dist_df.to_csv(out_csv, index=False)

dist_df.head()


Out[5]:
STATION_ID VAR_ID VALUE ENTERED_BY ENTERED_DATE
0 37017 319 119.943635 JES 28.06.2016
1 37043 319 180.382850 JES 28.06.2016
2 37020 319 133.560750 JES 28.06.2016
3 36991 319 116.305747 JES 28.06.2016
4 37061 319 16.462381 JES 28.06.2016

The file distance_to_coast_upload.csv has now been uploaded to the database via Access.

2. Norway sites

I had a short meeting with Heleen on 28/06/2016 to discuss metadata for the Norwegian sites. There are 7 Norwegian sites in the main ICPW project, but 83 in the wider TOC trends analysis. Note the following:

  • Table 2 of Heleen's 2008 HESS paper gives basic land use proportions for four sites (Birkenes, Storgama, Langtjern and Kårvatn) that currently have incomplete data in RESA2. Translating this information into the land use classes in RESA2 gives the proportions shown in the table below. I've saved these values in heleen_hess2008_land_use.xlsx and uploaded them to the database.

    NB: I'm assuming transitional woodland is a sub-set of total forest area.

    NB2: Land use proportions are structured strangely in the database, with some of them attached directly to the stations table. This isn't ideal. In particular, note that the column named CATCHMENT_WATER_AREA actually corresponds to var_id 23 (Water area excl. lake) and not var_id 318 (Water), as might be expected.

Station name Total forest area Deciduous area Coniferous area Peat area Bare rock Transitional woodland/scrub Water
Birkenes 90 10 80 7 3 0 0
Langtjern 67 25 3 5
Kårvatn 18 2 76 4
Storgama 11 0 0 22 59 11 8
  • Heleen has suggested that Anders Finstad might have land use proportions for (some of) the 83 Norwegian sites in the trends analysis. See Heleen's e-mail from 28/06/2016 at 09:56.

  • Heleen has suggested that Espen Lund or Øyvind Garmo might have shapefiles (or land use or mean elevation data) for the 7 Norwegian "feltforskningssstasjoner" (which are the sites missing elevation data mentioned in my previous notebook). She has also suggested that the new NVE webservice might allow catchment delineation - see e-mail received 28/06/2016 at 09:40.

    The NVE webservice is an impressive system, although it's very slow. It does allow generation of shapefiles, though, including land use proportions and median (not mean) elevation statistics. A test run for Birkenes yields 98% woodland cover and a median height of 226 m. These land use proportions don't agree exactly with what's given above, but I can certainly use this for estimating median elevations for the 7 Norwegian sites, which I'll assume can also be used as means as far as the database is concerned.

    Unfortunately, one of the sites (Storgama) is so small it is below the resolution of the NVE web service. Data for the other 6 is summarised in nevina_catch_props.xlsx. I've used these median elevation values to update the database.


In [7]:
# Read NIVINA results table
nev_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
            r'\Data\correct_norway\nevina_shapefiles\nevina_catch_props.xlsx')
nev_df = pd.read_excel(nev_xlsx, sheetname='data')

nev_df


Out[7]:
stn_id stn_name Area Median_Elev Sea Glacier Forest Arable Bog Bare rock Urban Effective sea
0 114 Svartetjernet 0.61 425 6.56 0 67.21 0.00 0.00 22.95 0 4.99
1 12025 Øygardsbekken 2.37 412 5.91 0 2.53 0.42 0.00 68.35 0 2.71
2 12081 Kårvatn 24.31 943 3.00 0 4.98 0.00 0.86 85.44 0 0.34
3 12082 Dalelv 3.15 172 15.87 0 36.19 0.00 5.71 40.63 0 7.29
4 129 Storvatn 6.56 171 19.36 0 7.16 0.00 0.15 55.95 0 18.29
5 221 Birkenes 0.42 226 0.00 0 97.62 0.00 0.00 0.00 0 0.00

For Storgama, I can't generate an exact catchment, but I can get a slightly bigger catchment based on an outflow point a little downstream. The map below shows the approximate location of the actual sampling point (yellow), overlaid on a catchment defined from the nearest point accessible via the NEVINA system (red). The larger catchment boundary has an area of $0.85 \; km^2$, compared to $0.6 \; km^2$ upstream of the yellow dot. However, the vast majority of the additional area is located at a lower elevation than the "true" catchment I'm interested in. I can use this fact to estimate the median elevation for the catchment above the yellow dot:

  • The elevations in NEVINA are most likely derived from a DEM i.e. a regular grid with a direct correlation between catchment area and number of cells enclosed (= number of elevation values).

  • To get from the "red" catchment to the desired "yellow" catchment, I therefore need to disregard $\frac{0.25}{0.8} \approx 30\%$ of the catchment area. If I assume this is the lower 30% of the catchment (which is roughly correct looking at the contours), this is the same as throwing away 30% of the DEM cells i.e. the lower 30% of elevation values. Under these assumptions, the median elevation for the smaller (yellow) catchment should be approximately equal to the $30 + \frac{100 - 30}{2} = 65th$ percentile elevation of the larger catchment.

  • I'll round this up and use the 70th percentile of the derived catchment from NEVINA to approximate the median elevation at Storgama. This gives a value of 626 m, which I've added to RESA2.

Update 01/07/2016

The land use proportions used by Anders are based on CORINE and are therefore not suitable - see e-mails from Anders Finstad (28/06/2016 at 21:09), Dag Olav Hessen (29/06/2016 at 08:45) and Heleen (29/06/2016 at 09:53). It therefore seems that detailed land use proportions are not available. Heleen has suggested that I could derive them based on catchment boundaries (already delineated by Tore?) and the Norwegian 1:50k scale land use maps (are these digitised?). See e-mail from Heleen sent 29/06/2016 at 16:08. Heleen has suggested that this job is low priority for the moment, though.

3. Czech sites

After sending out the 2016 call for data for the basic ICPW sites, Jakub responded to say one of the Czech sites seemed to be incorrectly named. In following this up, it turns out data from several different sites have been merged together - see e-mails from Jakub and Vladimir e.g. 29/06/2016 at 10:58.

This is going to take a bit of tidying up. One option is to delete all the Czech data from the database and then upload it again, but I'd like to avoid this if possible. Another issue is the Czech data include multiple datasets for the same site: Lysina (CZ07) has both monthly and weekly sampling, but only the monthly dataset is part of ICPW, whereas the weekly dataset is used for the DOC trends analysis (i.e. each series should really be assigned to a different project). Unfortunately, RESA2 has not been designed to allow different sampling and analysis protocols at the same site. The best I can do without restructuring the database is to have duplicate sites (e.g. Lysina_Weekly and Lysina_Monthly), but in general this is not a good idea as it breaks the principle of database normalisation, which could lead to problems later. This is actually a fundamental weakness of the RESA2 database and it's something we should look at more carefully in the future.

For the Czech data, some manual checking shows that the monthly dataset for Lysina is actually just a subset of the weekly data, so in this case it makes sense to keep everything together as a single dataset for now. We'll need to be a little bit careful with this, though, as I get the impression Jakub doesn't want the weekly data to be available via ICPW.

For the moment, the key issues to address are:

  • The current site CZ08 should be Uhlirska (not Pluhuv Bor) and the site properties needs correcting.

  • The weekly data for Pluhuv Bor need moving to a new site (CZ09?), which should be associated with the DOC trends project (but not the main ICPW project).

To help work out what data belongs where, see e-mails from Jakub (29/06/2016 at 10:58) and Vladimir (29/06/2016 at 14:20).

Begin by making Pluhuv Bor site CZ09 and changing CZ08 back to Uhlirska. The site properties for Uhlirska are available here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Tilsendte data fra Focalsentere\CzechRepublic\Innsendt2008\CZ_Innsendt 21 mars 08_Uhlirska_CZ_ICP_011193_061106.xls

The next step is to try to separate the Pluhuv Bor data from the Uhlirska data. This could be difficult, but to begin with I can try to do it by matching dates. The file

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Data\correct_czech\pluhuv_bor_wrong_code.xls

contains the weekly data for Pluhuv Bor, incorrectly labelled as site CZ08. If I can select just these records from the database and change the site code to CZ09, this should solve the problem. Note that this will only work as long as the sampling dates between Uhlirska and Pluhuv Bor are always different (i.e. they've never been sampled on the same day). To check this, I've copied all the dates I can find for the real Uhlirska into a new Excel file called uhlirska_sampliong_dates.xlsx. These can then be compared to the dates in the file above for Pluhuv Bor.


In [8]:
# Read data
plu_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
            r'\Data\correct_czech\pluhuv_bor_wrong_code.xls')

uhl_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
            r'\Data\correct_czech\uhlirska_sampling_dates.xlsx')

plu_df = pd.read_excel(plu_path, sheetname='Data')
uhl_df = pd.read_excel(uhl_path, sheetname='Sheet1')

# Check for duplicated dates
print uhl_df.duplicated('Date').sum()
print plu_df.duplicated('Date').sum()

# Get the intersection
set(uhl_df['Date']).intersection(set(plu_df['Date']))


0
39
Out[8]:
{Timestamp('1994-03-03 00:00:00'),
 Timestamp('1994-03-31 00:00:00'),
 Timestamp('1994-11-30 00:00:00'),
 Timestamp('1996-05-02 00:00:00'),
 Timestamp('1996-11-04 00:00:00'),
 Timestamp('1997-06-05 00:00:00'),
 Timestamp('1997-07-31 00:00:00'),
 Timestamp('1998-09-03 00:00:00'),
 Timestamp('1999-02-04 00:00:00'),
 Timestamp('1999-03-03 00:00:00'),
 Timestamp('2000-03-02 00:00:00'),
 Timestamp('2000-06-01 00:00:00'),
 Timestamp('2001-04-05 00:00:00'),
 Timestamp('2002-01-03 00:00:00'),
 Timestamp('2002-02-28 00:00:00'),
 Timestamp('2002-05-30 00:00:00'),
 Timestamp('2003-01-03 00:00:00'),
 Timestamp('2003-02-06 00:00:00'),
 Timestamp('2003-03-06 00:00:00'),
 Timestamp('2003-05-05 00:00:00'),
 Timestamp('2003-10-06 00:00:00'),
 Timestamp('2004-03-03 00:00:00'),
 Timestamp('2004-06-02 00:00:00'),
 Timestamp('2005-02-03 00:00:00'),
 Timestamp('2005-03-31 00:00:00'),
 Timestamp('2005-06-02 00:00:00'),
 Timestamp('2005-07-01 00:00:00'),
 Timestamp('2005-07-28 00:00:00'),
 Timestamp('2005-09-01 00:00:00'),
 Timestamp('2006-07-27 00:00:00'),
 Timestamp('2007-01-31 00:00:00'),
 Timestamp('2007-08-01 00:00:00'),
 Timestamp('2007-10-01 00:00:00'),
 Timestamp('2008-08-14 00:00:00'),
 Timestamp('2009-03-01 00:00:00'),
 Timestamp('2009-10-01 00:00:00'),
 Timestamp('2010-02-01 00:00:00'),
 Timestamp('2010-11-01 00:00:00'),
 Timestamp('2011-06-01 00:00:00'),
 Timestamp('2011-08-04 00:00:00'),
 Timestamp('2011-11-01 00:00:00'),
 Timestamp('2012-10-04 00:00:00')}

So - unfortuantely - there are matching dates between the two sites, so using date alone isn't going to work. The only other option I can think of (other than starting again) is to upload the Pluhuv Bor data from scratch using the code CZ09, and then search through the data for CZ08, remvoing any records that are the same. Tore's tidied data for Pluhuv Bor (incorrectly labelled as CZ08) are here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Tilsendte data fra Focalsentere\CzechRepublic\Innsendt2015\ICPCZ90-12ALLed.xls

I've copied this to a new file called pluhuv_bor_weekly.xlsx, removed the data for the other sites and then changed the site code to CZ09. Unfortunately, my attempts to upload this data to RESA2 via Access have failed. I'm not sure what the problem is, but the Access connection keeps timing out. To get around this, I've written my own basic code for uploading ICPW templates, which is here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\upload_icpw_template.ipynb

I have used this code to upload the (hopefully correct) data for Pluhuv Bor as site CZ09. The next step is to try to identify the records in the Uhlirska series that actually belong to Pluhuv Bor. I can't do this purely based upon date, but hopefully I can using date and method and value.


In [2]:
# Create and test a db connection

# Use custom RESA2 function to connect to db
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()

# Test SQL statement
sql = ('SELECT project_id, project_number, project_name '
       'FROM resa2.projects')
df = pd.read_sql_query(sql, engine)

df.head(10)


Out[2]:
project_id project_number project_name
0 1130 None FJORDA
1 2945 O 29010 06 Otra sidebekker
2 2965 O 29264 02 VRD-Telemark-B�rsesj�
3 1133 None FLAGSTAD
4 1135 None FLORID
5 1140 None FROMLAB
6 1141 None FROYSET
7 1142 None GAULA
8 1143 None GJELLESOGN
9 1144 None GPVEST

In [19]:
# Get all water chem results currently in db for Uhlirska and Pluhuv Bor

# Uhlirska
sql = ("SELECT * FROM RESA2.WATER_CHEMISTRY_VALUES2 "
       "WHERE SAMPLE_ID IN (SELECT WATER_SAMPLE_ID "
                           "FROM RESA2.WATER_SAMPLES "
                           "WHERE STATION_ID = 37745)")
uhl_df = pd.read_sql_query(sql, engine)

# Pluhuv Bor
sql = ("SELECT * FROM RESA2.WATER_CHEMISTRY_VALUES2 "
       "WHERE SAMPLE_ID IN (SELECT WATER_SAMPLE_ID "
                           "FROM RESA2.WATER_SAMPLES "
                           "WHERE STATION_ID = 33326)")
plu_df = pd.read_sql_query(sql, engine)

# Get sample dates
sql = ("SELECT WATER_SAMPLE_ID, SAMPLE_DATE "
       "FROM RESA2.WATER_SAMPLES "
       "WHERE STATION_ID IN (33326, 37745)")
sam_df = pd.read_sql_query(sql, engine)

# Join dates to each site
uhl_df = pd.merge(uhl_df, sam_df, how='left',
                  left_on='sample_id', right_on='water_sample_id')
uhl_df = uhl_df[['value_id', 'sample_date', 'sample_id', 
                 'method_id', 'value', 'flag1']]

plu_df = pd.merge(plu_df, sam_df, how='left',
                  left_on='sample_id', right_on='water_sample_id')
plu_df = plu_df[['value_id', 'sample_date', 'sample_id', 
                 'method_id', 'value', 'flag1']]

print 'Number of records for Pluhuv Bor: %s.' % len(plu_df)
print 'Number of records for Uhlirska:   %s.' % len(uhl_df)

plu_df.head()


Number of records for Pluhuv Bor: 14756.
Number of records for Uhlirska:   18090.
Out[19]:
value_id sample_date sample_id method_id value flag1
0 7483966 1991-06-06 651647 10249 20.00 None
1 7473224 1991-06-06 651647 10251 4.24 None
2 7481195 1991-06-06 651647 10253 1.48 None
3 7483161 1991-06-06 651647 10256 60.00 None
4 7477083 1991-06-06 651647 10258 0.46 None

We can now work through all the entries in plu_df, testing to see if the dates, methods, values and flags match those in uhl_df. If they do, it's likely this is a duplicate that should be deleted from the Uhlirska series. Looking at the length of each series shown above, we might expect to find $(18090 - 14756) = 3334$ duplicates.


In [31]:
# Round the value column to 2 decimal places
uhl_df['value2'] = uhl_df['value'].round(2)
plu_df['value2'] = plu_df['value'].round(2)

# Join
join_df = pd.merge(uhl_df, plu_df, how='left',
                   left_on=['sample_date', 'method_id', 'value2', 'flag1'],
                   right_on=['sample_date', 'method_id', 'value2', 'flag1'])
print len(join_df)

print pd.notnull(join_df['value_y']).sum()
join_df.head(10)


18090
13381
Out[31]:
value_id_x sample_date sample_id_x method_id value_x flag1 value2 value_id_y sample_id_y value_y
0 5653803 1991-06-06 574942 10249 20.00 None 20.00 7483966.0 651647.0 20.00
1 5653804 1991-06-06 574942 10251 4.24 None 4.24 7473224.0 651647.0 4.24
2 5653805 1991-06-06 574942 10253 1.48 None 1.48 7481195.0 651647.0 1.48
3 5653806 1991-06-06 574942 10256 60.00 None 60.00 7483161.0 651647.0 60.00
4 5653807 1991-06-06 574942 10258 0.46 None 0.46 7477083.0 651647.0 0.46
5 5653808 1991-06-06 574942 10260 22.90 None 22.90 7471694.0 651647.0 22.90
6 5653809 1991-06-06 574942 10261 25.60 None 25.60 7472422.0 651647.0 25.60
7 5653810 1991-06-06 574942 10263 1.55 None 1.55 7475982.0 651647.0 1.55
8 5653811 1991-06-06 574942 10264 15.00 < 15.00 7474269.0 651647.0 15.00
9 5653812 1991-06-06 574942 10265 33.90 None 33.90 NaN NaN NaN

Unfortunately, rounding differences in the way the data has been uploaded previously mean that I am unable to match based on values, which is crucial for this method to work reliably (see record 9 in the dataframe above). Unfortunately, I think the only option is therefore to upload all the data for Uhlirska again.

I've copied all the monthly Uhlirska data to a new Excel file here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Data\correct_czech\uhlirska_monthly.xlsx

I can now drop all the samples associated with this site from the database and upload the data again. Let's hope this works! The following SQL deletes the records. Note that care needs to be taken with some of the constraints.

First delete from RESA2.WATER_CHEMISTRY_VALUES:

DELETE FROM RESA2.WATER_CHEMISTRY_VALUES2
WHERE sample_id IN (SELECT water_sample_id
                    FROM RESA2.WATER_SAMPLES
                    WHERE STATION_ID = 37745);

There is also a table called RESA2.SAMPLE_SELECTIONS that links water samples to projects. The Czech ICPW project has PROJECT_ID=2986. I'm not sure whether these need to be added again afterwards - as far as I can tell, the upload procedure for ICPW doesn't add records to this table. Bear this in mind though: if you run into problems later, you need to add records to this table for the samples from sites CZ08 and CZ09. If you do this, the SAMPLE_SELECTION_ID should equal 52, which corresponds to PROJECT_ID 2986 in the SAMPLE_SELECTION_DEFINITIONS table.

DELETE FROM RESA2.SAMPLE_SELECTIONS
WHERE water_sample_id IN (SELECT water_sample_id
                          FROM RESA2.WATER_SAMPLES
                          WHERE STATION_ID = 37745);

Finally, we can delete from the WATER_SAMPLES table:

DELETE FROM RESA2.WATER_SAMPLES
WHERE STATION_ID = 37745;

Having committed these changes, I can now upload the Uhlirska data again using the notebook and spreadsheet linked above.

I have also uploaded new data for sites CZ01 to CZ07, as given in the spreadsheet sent by Vladimir on 29/06/2016 at 14:20. Samples for these 7 sites that are not already included in the database have been copied here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016\Replies\czech_republic\ICPDataCZ2016.xls

and then uploaded.

Finally, I have moved site CZ09 (Pluhuv Bor) from the ICPWaters CZ project to the ICPW_TOCTRENDS_2015_CZ project, as requested by Jakub.

3.2. Check silica values

One final issue with the Czech sites concerns the values reported for silica. Due to an ambiguity in the input template, the Czechs have not always reported their units correctly - see e-mail from Vladimir received 01/07/2016 at 11:06 for details.

The overall conclusion is that the units will need changing for particular silica samples. First, let's plot the data to see if there are any obvious issues, other than those already identified by Vladimir. Iæve already written some convenience plotting functions for RESA2, which can hopefully be used here.


In [12]:
# Specify sites and parameters of interest
stn_list = ['CZ01', 'CZ02', 'CZ03', 'CZ04', 'CZ05', 'CZ06', 'CZ07', 'CZ08', 'CZ09']
par_list = ['SiO2',]

# Period of interest
st_dt = '1980-01-01'   # yyyy-mm-dd
end_dt = '2015-12-31'  # yyyy-mm-dd

# Create plots
resa2.plot_resa2_ts(stn_list, par_list, st_dt, end_dt)


Out[12]:

It seems pretty clear from the above plots that something strange is happening with silica in the latter part of the records. Most of the lake sites (CZ01 to CZ06) show a dip in 2007 that could suggest $Si$ being reported as $SiO_2$. These same lakes show a distinctive peak from around 2010 onwards (but with a dip in 2013), which could be $SiO_2$ reported as $Si$.

Vladimir's e-mail focuses on the data collected since 02/08/2010. He says all the silica values have actually been reported as $SiO_2$ (not $Si$ as stated in his spreadsheet), except for the measurements during 2013 which are actually $Si$. This probably explains the anomalous pattern observed above.

To correct these problems, I need to change the method_id for all the silica measurements from sites CZ01 to CZ06 inclusive for the period since 02/08/2010 (but excluding those collected during 2013). The method_id needs changing from 10289 (for $Si$) to 10270 (for $SiO_2$). This is done by the following SQL

UPDATE RESA2.WATER_CHEMISTRY_VALUES2
SET METHOD_ID = 10270
WHERE SAMPLE_ID IN (SELECT WATER_SAMPLE_ID FROM RESA2.WATER_SAMPLES
                    WHERE STATION_ID IN (SELECT STATION_ID FROM RESA2.STATIONS
                                         WHERE STATION_CODE IN ('CZ01', 'CZ02', 'CZ03', 'CZ04', 'CZ05', 'CZ06'))
                    AND RESA2.WATER_SAMPLES.SAMPLE_DATE >= DATE '2010-08-02'
                    AND EXTRACT(YEAR FROM SAMPLE_DATE) <> 2013)
AND RESA2.WATER_CHEMISTRY_VALUES2.METHOD_ID = 10289;

Re-running the code above, the plots now look like this:


In [14]:
# Create plots
resa2.plot_resa2_ts(stn_list, par_list, st_dt, end_dt)


Out[14]:

This is clearly an improvement, but I'm still suspicious about the values in 2007 and early in 2010. Follow this up with Vladimir.

Finaly, Vladimir also suspects the 2014 data for CZ08 have been mistakenly reported as $Si$ when theyære actually $SiO_2$. This looks likely, but I also suspect some of the 2013 values have been misreported as well. Check with Vladimir.

I have summarised my work so far on the Czech sites in and e-mail to Jakub sent 12/07/2016 at 12:01.

4. Swedish sites

Salar has showed me two online databases with information for the Swedish sites: VISS and AROS. Unfortunately, the land use data in AROS appears to be based on CORINE, and in the meeting on 27/05/2016 it was decided that this wasn't really good enough for the work being proposed.

The land use information in VISS is more useful but, sadly, it's only available for a handful of the sites. See e-mail sent to Jens 12/07/2016 at 12:48 for latest details.

5. Newfoundland sites

Prior to 1995, the DOC analysis for the Newfoundland water samples used a persulphate digestion technique that has since been shown to significantly underestimate actual DOC concentrations (see e-mail from Heleen received 23/05/2016 at 14:01 and this paper for details).

To allow for this, data for the 12 Newfoundland sites (associated with project ICPW_TOCTRENDS_2015_CA_NF) needs adjusting by applying a correction factor of 1.28 to all DOC values measured prior to 01/01/1995.

To do this I've created a new method (method_id = 10823) for the old persulphate digestion technique and have set conversion factors of 1.28 in RESA2.WC_PARAMETERS_METHODS. I've then used the following UPDATE query to assign DOC measurements for the Newfoundland sites collected prior to 1995 to this new method:

UPDATE resa2.water_chemistry_values2
SET method_id = 10823
WHERE sample_id IN (SELECT water_sample_id FROM resa2.water_samples
                    WHERE station_id IN (SELECT station_id FROM resa2.projects_stations
                                         WHERE project_id = 4050)
                    AND sample_date < DATE '1995-01-01')
AND method_id = 10294;

Some of the sites in Nova Scotia have also used this method in the past, but Salar has previously corrected these DOC values using a slightly different workflow. I have checked that Salar's changes have been finalised in the database itself, but I haven't done anything further regarding changing method_ids etc., because I don't want to end up applying the correction factor twice.