In [19]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import imp
from sqlalchemy import create_engine
Some further changes to the ICPW trends analysis are required:
Heleen has discovered some strange results for TOC for some of the Canadian sites (see e-mail received 14/03/2017 at 17.45)
We now have elevation data for the remaining sites (see e-mail received 15/03/2017 at 08.37)
Heleen would like a "grid cell ID" adding to the climate processing output (see e-mail received 15/03/2017 13.33)
Having made the above changes, the whole climate data and trends analysis needs re-running. This notebook deals with points 1 and 2 above; point 3 requires a small modification to the existing climate code.
This is a bit more complicated than it first appears. It looks as though a lot of dupicate data was uploaded to the database at some point, and some of the duplicates have incorrect method names. For the Ontairo lakes, the same values have been uploaded both as DOC (in mg-C/l) and as "DOCx", which is in umol-C/l. The conversion factor from DOCx to DOC is therefore 0.012, which is very close to Heleen's estimated correction factor of dividing by 100. The problem is that the database appears to be selecting which values to display more-or-less at random. This is illustrated below.
In [22]:
# Create db connection
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)
engine, conn = resa2.connect_to_resa2()
In [23]:
# Get example data
sql = ("SELECT * FROM resa2.water_chemistry_values2 "
"WHERE sample_id = (SELECT water_sample_id "
"FROM resa2.water_samples "
"WHERE station_id = 23466 "
"AND sample_date = DATE '2000-05-23') "
"AND method_id IN (10313, 10294)")
df = pd.read_sql_query(sql, engine)
df
Out[23]:
method_id=10294
is DOC in mg-C/l, whereas method_id=10313
is DOCx in umol-C/l. Both were uploaded within the space of a few weeks back in 2006. I assume that the values with method_id=10313
are correct, and those with method_id=10294
are wrong.
It seems as though, when both methods are present, RESA2 preferentially chooses method_id=10313
, which is why most of the data look OK. However, if method_id=10313
is not available, the database uses the values for method_id=10294
instead, and these values are wrong. The problem is that this selection isn't deliberate: the database only prefers method_id=10313
because it appears lower in the table than method_id=10294
. Essentially, it's just a fluke that most of the data turn out OK - it could easily have been the other way around.
To fix this, I need to:
Go through all the samples from the Ontario sites and see whether there are values for both method_id=10313
and method_id=10294
If yes, see whether the raw values are the same. If so, delete the value for method_id=10294
If values are only entered with method_id=10294
, check to see whether they are too large and, if so, switch the method_id
to 10313
This is done below.
In [33]:
# Get a list of all water samples associated with
# stations in the 'ICPW_TOCTRENDS_2015_CA_ICPW' project
sql = ("SELECT water_sample_id FROM resa2.water_samples "
"WHERE station_id IN ( "
"SELECT station_id FROM resa2.stations "
"WHERE station_id IN ( "
"SELECT station_id FROM resa2.projects_stations "
"WHERE project_id IN ( "
"SELECT project_id FROM resa2.projects "
"WHERE project_name = 'ICPW_TOCTRENDS_2015_CA_ICPW')))")
samp_df = pd.read_sql_query(sql, engine)
# Loop over samples and check whether both method_ids are present
for samp_id in samp_df['water_sample_id'].values:
# Get data for this sample
sql = ("SELECT method_id, value "
"FROM resa2.water_chemistry_values2 "
"WHERE sample_id = %s "
"AND method_id IN (10294, 10313)" % samp_id)
df = pd.read_sql_query(sql, engine)
df.index = df['method_id']
del df['method_id']
# How many entries for DOC?
if len(df) == 1:
# We have just one of the two methods
if df.index[0] == 10294:
# Should be DOC in mg-C/l and values should be <50
if df['value'].values[0] > 50:
# Method_ID must be wrong
sql = ('UPDATE resa2.water_chemistry_values2 '
'SET method_id = 10313 '
'WHERE sample_id = %s '
'AND method_id = 10294' % samp_id)
result = conn.execute(sql)
# Otherwise we have both methods
elif len(df) == 2:
# Are they the same and large?
if (df.loc[10313].value == df.loc[10294].value) and (df.loc[10313].value > 50):
# Delete record for method_id=10294
sql = ('DELETE FROM resa2.water_chemistry_values2 '
'WHERE sample_id = %s '
'AND method_id = 10294' % samp_id)
result = conn.execute(sql)
print 'Finished.'
Heleen has provided the missing elevation data, which I copied here:
C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\CRU_Climate_Data\missing_elev_data.xlsx
In [36]:
# Read elev data
in_xlsx = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\CRU_Climate_Data\missing_elev_data.xlsx')
elev_df = pd.read_excel(in_xlsx)
elev_df.index = elev_df['station_id']
# Loop over stations and update info
for stn_id in elev_df['station_id'].values:
# Get elev
elev = elev_df.loc[stn_id]['altitude']
# Update rows
sql = ('UPDATE resa2.stations '
'SET altitude = %s '
'WHERE station_id = %s' % (elev, stn_id))
result = conn.execute(sql)