In [19]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import imp
from sqlalchemy import create_engine

TOC and elevation corrections

Some further changes to the ICPW trends analysis are required:

Heleen has discovered some strange results for TOC for some of the Canadian sites (see e-mail received 14/03/2017 at 17.45)
We now have elevation data for the remaining sites (see e-mail received 15/03/2017 at 08.37)
Heleen would like a "grid cell ID" adding to the climate processing output (see e-mail received 15/03/2017 13.33)

Having made the above changes, the whole climate data and trends analysis needs re-running. This notebook deals with points 1 and 2 above; point 3 requires a small modification to the existing climate code.

1. Correct TOC

This is a bit more complicated than it first appears. It looks as though a lot of dupicate data was uploaded to the database at some point, and some of the duplicates have incorrect method names. For the Ontairo lakes, the same values have been uploaded both as DOC (in mg-C/l) and as "DOCx", which is in umol-C/l. The conversion factor from DOCx to DOC is therefore 0.012, which is very close to Heleen's estimated correction factor of dividing by 100. The problem is that the database appears to be selecting which values to display more-or-less at random. This is illustrated below.



In [22]:

    
# Create db connection
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()



In [23]:

    
# Get example data
sql = ("SELECT * FROM resa2.water_chemistry_values2 "
       "WHERE sample_id = (SELECT water_sample_id "
                          "FROM resa2.water_samples "
                          "WHERE station_id = 23466 "
                          "AND sample_date = DATE '2000-05-23') "
       "AND method_id IN (10313, 10294)")

df = pd.read_sql_query(sql, engine)

df









    Out[23]:






  
    
      
      value_id
      sample_id
      method_id
      value
      flag1
      flag2
      approved
      remark
      entered_by
      entered_date
      detection_limit
      labware_status
    
  
  
    
      0
      4011483
      330853
      10294
      290
      None
      None
      YES
      None
      RESA2
      2006-03-21 14:07:47
      None
      None
    
    
      1
      3688111
      330853
      10313
      290
      None
      None
      YES
      None
      RESA2
      2006-02-17 13:38:39
      None
      None

method_id=10294 is DOC in mg-C/l, whereas method_id=10313 is DOCx in umol-C/l. Both were uploaded within the space of a few weeks back in 2006. I assume that the values with method_id=10313 are correct, and those with method_id=10294 are wrong.

It seems as though, when both methods are present, RESA2 preferentially chooses method_id=10313, which is why most of the data look OK. However, if method_id=10313 is not available, the database uses the values for method_id=10294 instead, and these values are wrong. The problem is that this selection isn't deliberate: the database only prefers method_id=10313 because it appears lower in the table than method_id=10294. Essentially, it's just a fluke that most of the data turn out OK - it could easily have been the other way around.

To fix this, I need to:

Go through all the samples from the Ontario sites and see whether there are values for both method_id=10313 and method_id=10294
If yes, see whether the raw values are the same. If so, delete the value for method_id=10294
If values are only entered with method_id=10294, check to see whether they are too large and, if so, switch the method_id to 10313

This is done below.



In [33]:

    
# Get a list of all water samples associated with
# stations in the 'ICPW_TOCTRENDS_2015_CA_ICPW' project
sql = ("SELECT water_sample_id FROM resa2.water_samples "
       "WHERE station_id IN ( "
         "SELECT station_id FROM resa2.stations "
         "WHERE station_id IN ( "
           "SELECT station_id FROM resa2.projects_stations "
           "WHERE project_id IN ( "
             "SELECT project_id FROM resa2.projects "
             "WHERE project_name = 'ICPW_TOCTRENDS_2015_CA_ICPW')))")
     
samp_df = pd.read_sql_query(sql, engine)

# Loop over samples and check whether both method_ids are present
for samp_id in samp_df['water_sample_id'].values:
    # Get data for this sample
    sql = ("SELECT method_id, value "
           "FROM resa2.water_chemistry_values2 "
           "WHERE sample_id = %s "
           "AND method_id IN (10294, 10313)" % samp_id)
    df = pd.read_sql_query(sql, engine)
    df.index = df['method_id']
    del df['method_id']
    
    # How many entries for DOC?
    if len(df) == 1:
        # We have just one of the two methods
        if df.index[0] == 10294:
            # Should be DOC in mg-C/l and values should be <50
            if df['value'].values[0] > 50:
                # Method_ID must be wrong
                sql = ('UPDATE resa2.water_chemistry_values2 '
                       'SET method_id = 10313 '
                       'WHERE sample_id = %s '
                       'AND method_id = 10294' % samp_id)
                result = conn.execute(sql)
    
    # Otherwise we have both methods
    elif len(df) == 2:
        # Are they the same and large?
        if (df.loc[10313].value == df.loc[10294].value) and (df.loc[10313].value > 50):
            # Delete record for method_id=10294
            sql = ('DELETE FROM resa2.water_chemistry_values2 '
                   'WHERE sample_id = %s '
                   'AND method_id = 10294' % samp_id)
            result = conn.execute(sql)

print 'Finished.'









    



Finished.

2. Update station elevations

Heleen has provided the missing elevation data, which I copied here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\CRU_Climate_Data\missing_elev_data.xlsx



In [36]:

    
# Read elev data
in_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\CRU_Climate_Data\missing_elev_data.xlsx')
elev_df = pd.read_excel(in_xlsx)
elev_df.index = elev_df['station_id']

# Loop over stations and update info
for stn_id in elev_df['station_id'].values:
    # Get elev
    elev = elev_df.loc[stn_id]['altitude']

    # Update rows
    sql = ('UPDATE resa2.stations '
           'SET altitude = %s '
           'WHERE station_id = %s' % (elev, stn_id))
    result = conn.execute(sql)

	value_id	sample_id	method_id	value	flag1	flag2	approved	remark	entered_by	entered_date	detection_limit	labware_status
0	4011483	330853	10294	290	None	None	YES	None	RESA2	2006-03-21 14:07:47	None	None
1	3688111	330853	10313	290	None	None	YES	None	RESA2	2006-02-17 13:38:39	None	None