In [19]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import imp
from sqlalchemy import create_engine

TOC and elevation corrections

Some further changes to the ICPW trends analysis are required:

  1. Heleen has discovered some strange results for TOC for some of the Canadian sites (see e-mail received 14/03/2017 at 17.45)

  2. We now have elevation data for the remaining sites (see e-mail received 15/03/2017 at 08.37)

  3. Heleen would like a "grid cell ID" adding to the climate processing output (see e-mail received 15/03/2017 13.33)

Having made the above changes, the whole climate data and trends analysis needs re-running. This notebook deals with points 1 and 2 above; point 3 requires a small modification to the existing climate code.

1. Correct TOC

This is a bit more complicated than it first appears. It looks as though a lot of dupicate data was uploaded to the database at some point, and some of the duplicates have incorrect method names. For the Ontairo lakes, the same values have been uploaded both as DOC (in mg-C/l) and as "DOCx", which is in umol-C/l. The conversion factor from DOCx to DOC is therefore 0.012, which is very close to Heleen's estimated correction factor of dividing by 100. The problem is that the database appears to be selecting which values to display more-or-less at random. This is illustrated below.


In [22]:
# Create db connection
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()

In [23]:
# Get example data
sql = ("SELECT * FROM resa2.water_chemistry_values2 "
       "WHERE sample_id = (SELECT water_sample_id "
                          "FROM resa2.water_samples "
                          "WHERE station_id = 23466 "
                          "AND sample_date = DATE '2000-05-23') "
       "AND method_id IN (10313, 10294)")

df = pd.read_sql_query(sql, engine)

df


Out[23]:
value_id sample_id method_id value flag1 flag2 approved remark entered_by entered_date detection_limit labware_status
0 4011483 330853 10294 290 None None YES None RESA2 2006-03-21 14:07:47 None None
1 3688111 330853 10313 290 None None YES None RESA2 2006-02-17 13:38:39 None None

method_id=10294 is DOC in mg-C/l, whereas method_id=10313 is DOCx in umol-C/l. Both were uploaded within the space of a few weeks back in 2006. I assume that the values with method_id=10313 are correct, and those with method_id=10294 are wrong.

It seems as though, when both methods are present, RESA2 preferentially chooses method_id=10313, which is why most of the data look OK. However, if method_id=10313 is not available, the database uses the values for method_id=10294 instead, and these values are wrong. The problem is that this selection isn't deliberate: the database only prefers method_id=10313 because it appears lower in the table than method_id=10294. Essentially, it's just a fluke that most of the data turn out OK - it could easily have been the other way around.

To fix this, I need to:

  1. Go through all the samples from the Ontario sites and see whether there are values for both method_id=10313 and method_id=10294

  2. If yes, see whether the raw values are the same. If so, delete the value for method_id=10294

  3. If values are only entered with method_id=10294, check to see whether they are too large and, if so, switch the method_id to 10313

This is done below.


In [33]:
# Get a list of all water samples associated with
# stations in the 'ICPW_TOCTRENDS_2015_CA_ICPW' project
sql = ("SELECT water_sample_id FROM resa2.water_samples "
       "WHERE station_id IN ( "
         "SELECT station_id FROM resa2.stations "
         "WHERE station_id IN ( "
           "SELECT station_id FROM resa2.projects_stations "
           "WHERE project_id IN ( "
             "SELECT project_id FROM resa2.projects "
             "WHERE project_name = 'ICPW_TOCTRENDS_2015_CA_ICPW')))")
     
samp_df = pd.read_sql_query(sql, engine)

# Loop over samples and check whether both method_ids are present
for samp_id in samp_df['water_sample_id'].values:
    # Get data for this sample
    sql = ("SELECT method_id, value "
           "FROM resa2.water_chemistry_values2 "
           "WHERE sample_id = %s "
           "AND method_id IN (10294, 10313)" % samp_id)
    df = pd.read_sql_query(sql, engine)
    df.index = df['method_id']
    del df['method_id']
    
    # How many entries for DOC?
    if len(df) == 1:
        # We have just one of the two methods
        if df.index[0] == 10294:
            # Should be DOC in mg-C/l and values should be <50
            if df['value'].values[0] > 50:
                # Method_ID must be wrong
                sql = ('UPDATE resa2.water_chemistry_values2 '
                       'SET method_id = 10313 '
                       'WHERE sample_id = %s '
                       'AND method_id = 10294' % samp_id)
                result = conn.execute(sql)
    
    # Otherwise we have both methods
    elif len(df) == 2:
        # Are they the same and large?
        if (df.loc[10313].value == df.loc[10294].value) and (df.loc[10313].value > 50):
            # Delete record for method_id=10294
            sql = ('DELETE FROM resa2.water_chemistry_values2 '
                   'WHERE sample_id = %s '
                   'AND method_id = 10294' % samp_id)
            result = conn.execute(sql)

print 'Finished.'


Finished.

2. Update station elevations

Heleen has provided the missing elevation data, which I copied here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\CRU_Climate_Data\missing_elev_data.xlsx


In [36]:
# Read elev data
in_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\CRU_Climate_Data\missing_elev_data.xlsx')
elev_df = pd.read_excel(in_xlsx)
elev_df.index = elev_df['station_id']

# Loop over stations and update info
for stn_id in elev_df['station_id'].values:
    # Get elev
    elev = elev_df.loc[stn_id]['altitude']

    # Update rows
    sql = ('UPDATE resa2.stations '
           'SET altitude = %s '
           'WHERE station_id = %s' % (elev, stn_id))
    result = conn.execute(sql)


120
170
87
86
100
120
113
105
100
90
129
160
100
90
135
105
98
665
571
457
673
30
61
197
130
73
43
152
150
300
236