In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
import mpld3
import pandas as pd
import numpy as np
import imp
from sqlalchemy import create_engine
sn.set_context('notebook')
Heleen has discovered that Ca data are missing from RESA2 for 25 Swedish sites in the TOC trends analysis - see e-mail received 07/02/2017 at 16.12. The relevant background data is gathered together in this spreadsheet:
C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Swedish_Ca_Data\Missing_Data_25_Swedish_Sites.xlsx
The 25 sites (with their RESA2 site codes) are listed in the resa2_stns_missing_data sheet. Salar has extracted the data for these sites from the SMHI database and I have copied this to the salar_data sheet and simplified the column headings (see e-mail from Heleen received 09/02/2017 at 10.35 for the original). The sheet from_resa_10-02-2017 shows all the Mg and Ca data from the Swedish sites currently associated with the ICPW_TOCTRENDS_2015_SE
project (as of 10/02/2017, before any errors were corrected). This should make it possible to identify which water samples are already in the database.
In [2]:
# Read data
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
r'\Swedish_Ca_Data\Missing_Data_25_Swedish_Sites.xlsx')
smhi_df = pd.read_excel(in_xls, sheetname='salar_data')
resa_df = pd.read_excel(in_xls, sheetname='from_resa_10-02-2017')
# Get just the cols of interest
smhi_df = smhi_df[['stn_code', 'stn_name', 'date', 'depth_m', 'ca_mekv/l', 'mg_mekv/l']]
resa_df = resa_df[['stn_code', 'stn_name', 'date', 'depth1', 'depth2', 'ca_uekv/l', 'mg_uekv/l']]
In [3]:
smhi_df.head(10)
Out[3]:
In [4]:
resa_df.head(10)
Out[4]:
The recent data from Salar includes more water samples than are currently in RESA2. In addition, it looks as though the sample depths in RESA2 might be incorrect. As a first step, let's see if we can match water samples based on sample date and station codes.
In [5]:
# Join datasets
df = pd.merge(resa_df, smhi_df, how='inner',
left_on=['stn_code', 'date'],
right_on=['stn_code', 'date'])
df.head()
Out[5]:
Next. check how much of the data downloaded by Salar is already in the database.
In [6]:
print len(smhi_df)
print len(df)
So, the recent data downloaded by Salar includes 2412 records for these 25 stations, of which 1910 are already in RESA2. Let's check if the values for Mg are comparable.
In [7]:
# Convert Mg to ueq/l and round to nearest whole number
df['mg_old_ueq/l'] = (df['mg_uekv/l'] + 0.5).astype(int)
df['mg_new_ueq/l'] = (df['mg_mekv/l']*1000 + 0.5).astype(int)
print 'All Mg values match:', (df['mg_old_ueq/l'] == df['mg_new_ueq/l']).all()
df.head()
Out[7]:
Excellent - the values in the RESA2 database for Mg match the values in the data Salar has recently downloaded.
However, the sample depths do not match: all the samples in RESA2 are assumed to have come from the surface (depth1=0
and depth2=0
), whereas Salar's data indicates they actually come from a variety of depths between 0.5 and 2 m. In the trend analysis so far, we've focused on samples taken within the upper 0.5 m of the water column. How much does this matter? If it is important, we'll need to correct the sample depths for all the Swedish sites, which will involve Salar downloading some more data followed by a bit of additonal work from me.
Heleen has replied to my e-mail to say not to worry about the water depths - just add the new Ca data to the existing water sample IDs (see e-mail received 10/02/2017 at 16.19).
First, connect to the database.
In [8]:
# Connect to db
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)
engine, conn = resa2.connect_to_resa2()
Next, get the sample IDs associated with these 25 sites and join them to the data above.
In [9]:
# Get all water samples for these sites
stns = tuple(df['stn_code'].unique())
sql = ('SELECT station_id, sample_date, water_sample_id '
'FROM resa2.water_samples '
'WHERE station_id IN %s' % str(stns))
samp_df = pd.read_sql_query(sql, engine)
# Join
df = pd.merge(df, samp_df, how='left',
left_on=['date', 'stn_code'],
right_on=['sample_date', 'station_id'])
df.head()
Out[9]:
In [10]:
# Extract columns of interest
df = df[['water_sample_id', 'ca_mekv/l']]
# Rename columns
df.columns = ['sample_id', 'value']
# Add new columns
df['method_id'] = 10551
df['approved'] = 'YES'
df['entered_by'] = 'JES'
df['entered_date'] = pd.datetime(2017, 2, 10)
# Reorder columns
df = df[['sample_id', 'method_id', 'value', 'approved', 'entered_by', 'entered_date']]
df.head(10)
Out[10]:
In [13]:
# Write to database
df.to_sql(name='WATER_CHEMISTRY_VALUES2',
schema='RESA2',
con=engine,
index=False,
if_exists='append')