In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
import mpld3
import pandas as pd
import numpy as np
import imp
from sqlalchemy import create_engine
sn.set_context('notebook')

Sweden: missing Ca data

Heleen has discovered that Ca data are missing from RESA2 for 25 Swedish sites in the TOC trends analysis - see e-mail received 07/02/2017 at 16.12. The relevant background data is gathered together in this spreadsheet:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Swedish_Ca_Data\Missing_Data_25_Swedish_Sites.xlsx

The 25 sites (with their RESA2 site codes) are listed in the resa2_stns_missing_data sheet. Salar has extracted the data for these sites from the SMHI database and I have copied this to the salar_data sheet and simplified the column headings (see e-mail from Heleen received 09/02/2017 at 10.35 for the original). The sheet from_resa_10-02-2017 shows all the Mg and Ca data from the Swedish sites currently associated with the ICPW_TOCTRENDS_2015_SE project (as of 10/02/2017, before any errors were corrected). This should make it possible to identify which water samples are already in the database.

1. Read data


In [2]:
# Read data
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
          r'\Swedish_Ca_Data\Missing_Data_25_Swedish_Sites.xlsx')

smhi_df = pd.read_excel(in_xls, sheetname='salar_data')
resa_df = pd.read_excel(in_xls, sheetname='from_resa_10-02-2017')

# Get just the cols of interest
smhi_df = smhi_df[['stn_code', 'stn_name', 'date', 'depth_m', 'ca_mekv/l', 'mg_mekv/l']]
resa_df = resa_df[['stn_code', 'stn_name', 'date', 'depth1', 'depth2', 'ca_uekv/l', 'mg_uekv/l']]

In [3]:
smhi_df.head(10)


Out[3]:
stn_code stn_name date depth_m ca_mekv/l mg_mekv/l
0 36658 Mossjön 1987-02-25 2.0 0.082 0.074
1 36658 Mossjön 1987-08-11 2.0 0.066 0.050
2 36658 Mossjön 1987-10-20 2.0 0.061 0.055
3 36658 Mossjön 1988-05-03 2.0 0.055 0.043
4 36658 Mossjön 1988-08-09 2.0 0.055 0.048
5 36658 Mossjön 1988-10-18 2.0 0.061 0.051
6 36658 Mossjön 1989-08-09 2.0 0.072 0.058
7 36658 Mossjön 1989-10-30 0.5 0.073 0.058
8 36658 Mossjön 1990-02-20 2.0 0.059 0.063
9 36658 Mossjön 1990-08-09 2.0 0.079 0.066

In [4]:
resa_df.head(10)


Out[4]:
stn_code stn_name date depth1 depth2 ca_uekv/l mg_uekv/l
0 36561 Abiskojaure 1988-06-21 0 0 182.991 44.987
1 36561 Abiskojaure 1988-08-10 0 0 158.992 36.989
2 36561 Abiskojaure 1988-09-12 0 0 184.991 42.988
3 36561 Abiskojaure 1989-03-14 0 0 342.983 78.977
4 36561 Abiskojaure 1989-06-14 0 0 248.988 61.982
5 36561 Abiskojaure 1989-08-23 0 0 164.992 37.989
6 36561 Abiskojaure 1989-10-11 0 0 202.990 48.986
7 36561 Abiskojaure 1990-04-17 0 0 323.984 86.975
8 36561 Abiskojaure 1990-08-14 0 0 138.993 33.990
9 36561 Abiskojaure 1990-09-11 0 0 153.992 37.989

2. Join data to identify common water samples

The recent data from Salar includes more water samples than are currently in RESA2. In addition, it looks as though the sample depths in RESA2 might be incorrect. As a first step, let's see if we can match water samples based on sample date and station codes.


In [5]:
# Join datasets
df = pd.merge(resa_df, smhi_df, how='inner',
              left_on=['stn_code', 'date'],
              right_on=['stn_code', 'date'])
df.head()


Out[5]:
stn_code stn_name_x date depth1 depth2 ca_uekv/l mg_uekv/l stn_name_y depth_m ca_mekv/l mg_mekv/l
0 36575 Örvattnet 1988-03-18 0 0 NaN 32.990 Örvattnet 2.0 0.059 0.033
1 36575 Örvattnet 1988-08-24 0 0 NaN 30.991 Örvattnet 2.0 0.052 0.031
2 36575 Örvattnet 1988-11-02 0 0 NaN 30.991 Örvattnet 1.5 0.054 0.031
3 36575 Örvattnet 1989-02-22 0 0 NaN 30.991 Örvattnet 2.0 0.067 0.031
4 36575 Örvattnet 1989-08-21 0 0 NaN 29.991 Örvattnet 2.0 0.055 0.030

Next. check how much of the data downloaded by Salar is already in the database.


In [6]:
print len(smhi_df)
print len(df)


2412
1910

So, the recent data downloaded by Salar includes 2412 records for these 25 stations, of which 1910 are already in RESA2. Let's check if the values for Mg are comparable.


In [7]:
# Convert Mg to ueq/l and round to nearest whole number
df['mg_old_ueq/l'] = (df['mg_uekv/l'] + 0.5).astype(int)
df['mg_new_ueq/l'] = (df['mg_mekv/l']*1000 + 0.5).astype(int)

print 'All Mg values match:', (df['mg_old_ueq/l'] == df['mg_new_ueq/l']).all()

df.head()


All Mg values match: True
Out[7]:
stn_code stn_name_x date depth1 depth2 ca_uekv/l mg_uekv/l stn_name_y depth_m ca_mekv/l mg_mekv/l mg_old_ueq/l mg_new_ueq/l
0 36575 Örvattnet 1988-03-18 0 0 NaN 32.990 Örvattnet 2.0 0.059 0.033 33 33
1 36575 Örvattnet 1988-08-24 0 0 NaN 30.991 Örvattnet 2.0 0.052 0.031 31 31
2 36575 Örvattnet 1988-11-02 0 0 NaN 30.991 Örvattnet 1.5 0.054 0.031 31 31
3 36575 Örvattnet 1989-02-22 0 0 NaN 30.991 Örvattnet 2.0 0.067 0.031 31 31
4 36575 Örvattnet 1989-08-21 0 0 NaN 29.991 Örvattnet 2.0 0.055 0.030 30 30

Excellent - the values in the RESA2 database for Mg match the values in the data Salar has recently downloaded.

However, the sample depths do not match: all the samples in RESA2 are assumed to have come from the surface (depth1=0 and depth2=0), whereas Salar's data indicates they actually come from a variety of depths between 0.5 and 2 m. In the trend analysis so far, we've focused on samples taken within the upper 0.5 m of the water column. How much does this matter? If it is important, we'll need to correct the sample depths for all the Swedish sites, which will involve Salar downloading some more data followed by a bit of additonal work from me.

3. Add Ca data

Heleen has replied to my e-mail to say not to worry about the water depths - just add the new Ca data to the existing water sample IDs (see e-mail received 10/02/2017 at 16.19).

First, connect to the database.


In [8]:
# Connect to db
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()

Next, get the sample IDs associated with these 25 sites and join them to the data above.


In [9]:
# Get all water samples for these sites
stns = tuple(df['stn_code'].unique())

sql = ('SELECT station_id, sample_date, water_sample_id '
       'FROM resa2.water_samples '
       'WHERE station_id IN %s' % str(stns))
samp_df = pd.read_sql_query(sql, engine)

# Join
df = pd.merge(df, samp_df, how='left',
              left_on=['date', 'stn_code'],
              right_on=['sample_date', 'station_id'])

df.head()


Out[9]:
stn_code stn_name_x date depth1 depth2 ca_uekv/l mg_uekv/l stn_name_y depth_m ca_mekv/l mg_mekv/l mg_old_ueq/l mg_new_ueq/l station_id sample_date water_sample_id
0 36575 Örvattnet 1988-03-18 0 0 NaN 32.990 Örvattnet 2.0 0.059 0.033 33 33 36575 1988-03-18 594956
1 36575 Örvattnet 1988-08-24 0 0 NaN 30.991 Örvattnet 2.0 0.052 0.031 31 31 36575 1988-08-24 594957
2 36575 Örvattnet 1988-11-02 0 0 NaN 30.991 Örvattnet 1.5 0.054 0.031 31 31 36575 1988-11-02 594958
3 36575 Örvattnet 1989-02-22 0 0 NaN 30.991 Örvattnet 2.0 0.067 0.031 31 31 36575 1989-02-22 594959
4 36575 Örvattnet 1989-08-21 0 0 NaN 29.991 Örvattnet 2.0 0.055 0.030 30 30 36575 1989-08-21 594960

In [10]:
# Extract columns of interest
df = df[['water_sample_id', 'ca_mekv/l']]

# Rename columns
df.columns = ['sample_id', 'value']

# Add new columns
df['method_id'] = 10551
df['approved'] = 'YES'
df['entered_by'] = 'JES'
df['entered_date'] = pd.datetime(2017, 2, 10)

# Reorder columns
df = df[['sample_id', 'method_id', 'value', 'approved', 'entered_by', 'entered_date']]

df.head(10)


Out[10]:
sample_id method_id value approved entered_by entered_date
0 594956 10551 0.059 YES JES 2017-02-10
1 594957 10551 0.052 YES JES 2017-02-10
2 594958 10551 0.054 YES JES 2017-02-10
3 594959 10551 0.067 YES JES 2017-02-10
4 594960 10551 0.055 YES JES 2017-02-10
5 594961 10551 0.058 YES JES 2017-02-10
6 594962 10551 0.047 YES JES 2017-02-10
7 594963 10551 0.060 YES JES 2017-02-10
8 594964 10551 0.059 YES JES 2017-02-10
9 594965 10551 0.060 YES JES 2017-02-10

In [13]:
# Write to database
df.to_sql(name='WATER_CHEMISTRY_VALUES2',
          schema='RESA2',
          con=engine,
          index=False,
          if_exists='append')


C:\Data\WinPython-64bit-2.7.10.3\python-2.7.10.amd64\lib\site-packages\pandas\io\sql.py:1165: UserWarning: The provided table name 'WATER_CHEMISTRY_VALUES2' is not found exactly as such in the database after writing the table, possibly due to case sensitivity issues. Consider using lower case table names.
  warnings.warn(msg, UserWarning)