In [29]:

    
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')

TOC trends 2015: database clean-up (part 3)

Back from holiday and trying to remember where I'd got to with the ICPW database. John has sent detailed replies to my questions regarding the US sites (see e-mails received 25/08/2016 at 01:45 and 03:33), so this seems like a good place to start.

1. Checking US sites

My initial clean-up of RESA2 ended with 72 sites linked to the ICPW_TOCTRENDS_2015_US_LTM project (see section2 of this notebook for details) and, following advice from Tore, I also shifted 19 US sites into a new project called ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED. In addition, manually checking RESA2 identifies 95 sites associated with the ICPWaters US project.

Based on the information in John's emails, some of this needs revisiting.

1.1. Connect to db and check site numbers are as expected



In [2]:

    
# Import custom functions and connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
                    r'\useful_resa2_code.py')

resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)

engine, conn = resa2_basic.connect_to_resa2()



In [4]:

    
# Check numbers of sites
proj_codes = [1679,   # ICPWaters US
              3870,   # ICPW_TOCTRENDS_2015_US_LTM
              4150]   # ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED

for code in proj_codes:
    # Read stations list
    sql = ('SELECT * '
           'FROM resa2.projects_stations '
           'WHERE project_id = %s' % code)
    df = pd.read_sql_query(sql, engine)
    
    print code, len(df)

1.2. Check sites against John's spreadsheet

Based on John's e-mail, four of the sites that I previously excluded should actually be included in the trends project, so I need to move these back across:

Site ID	Site name
040874O	Brook Trout Lake, Adirondacks
1A1-109O	Moss Lake, Adirondacks
1A1-110O	Lake Rondaxe, Adirondacks
ME-4474	Duck Pond, Maine

John has sent a spreadsheet (U.S.Site.Reconciliation.August.2016.xlsx) listing which sites should be included in which projects. First, compare the number of sites in John's spreadsheet with the numbers in RESA2.



In [3]:

    
# Read John's spreadsheet
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
           '\Replies\usa\U.S.Site.Reconciliation.August.2016.xlsx')

j_df = pd.read_excel(in_xls, sheetname='Sheet1')

print 'John, core ICPW sites: ', len(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'])
print 'John, DOC trends sites:', len(j_df[j_df['INCLUDE IN ICP DOC ANALYSIS']=='YES'])









    



John, core ICPW sites:  95
John, DOC trends sites: 76

This looks promising. With a bit of luck, I can just move the four sites listed above from the EXCLUDED project back into ICPW_TOCTRENDS_2015_US_LTM and everything will be OK...

...actually, no - it's not that simple. Of the 95 ICPWaters US sites in the spreadsheet, John only has RESA2 codes for 91 of them (four are marked NA). This implies the 95 sites currently in RESA2 are not the same as the 95 in the spreadsheet, so I need to work out where the discrepancies are.



In [28]:

    
# Get the 95 'ICPWaters US' sites from RESA2
sql = ('SELECT * FROM resa2.stations '
       'WHERE station_id IN (SELECT station_id '
                            'FROM resa2.projects_stations '
                            'WHERE project_id = 1679)')
core_df = pd.read_sql_query(sql, engine)

# Outer join based on RESA2 station code
df = core_df.merge(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'], 
                   how='outer', 
                   left_on='station_code',
                   right_on='Station Code')

# Cols of interest
cols = ['station_id', 'station_code', 'station_name', 'latitude', 'longitude',
        'Station Code', 'NFC_SITEID', 'NFC_SITENAME', 'Latitude', 'Longitude']
df = df[cols]

# Get entries where no match was found
df[pd.isnull(df['station_code']) | pd.isnull(df['Station Code'])]









    Out[28]:






  
    
      
      station_id
      station_code
      station_name
      latitude
      longitude
      Station Code
      NFC_SITEID
      NFC_SITENAME
      Latitude
      Longitude
    
  
  
    
      60
      23647.0
      US126
      Little, Winhall, Vermont
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      69
      23643.0
      US122
      Cow Mountain, Vermont
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      70
      23645.0
      US124
      Kettle, Vermont
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      71
      23620.0
      1C3-075E
      South � Marlboro, Vermont
      42.8439
      -72.7125
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      95
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1A1-029O
      MIDDLE POND
      44.3369
      -74.3719
    
    
      96
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1A2-076O
      BARNES LAKE
      43.5667
      -75.2278
    
    
      97
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1C1-106E
      SUCKER
      42.8250
      -73.1292
    
    
      98
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ME-9997O
      TUNK LAKE
      44.5986
      -68.0592

So, there are four stations (rows 60, 69, 70 and 71 in the table above) currently included in the ICPWaters US project that should be removed. There are also four stations (rows 95 to 98 above) that are present in John's spreadsheet but not associated with the project. Of these, two of them (Middle Pond and Tunk) are already in the database but associated with the ICPW_TOCTRENDS_2015_US_LTM project only, while one of them (Sucker) has been shifted to the ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED project. The remaining site (Barnes Lake) does not appear anywhere in the database as far as I can tell.

Next steps:

Remove four sites (Little, Cow Mountain, Kettle and South Marlboro) from the ICPWaters US project (and transfer them to EXCLUDED)
Add Middle Pond, Sucker and Tunk Lake to the ICPWaters US project
Create a new site for Barnes Lake and link it to the ICPWaters US project
Move four sites (Brook Trout Lake, Moss Lake, Lake Rondaxe and Duck Pond) from ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED to ICPW_TOCTRENDS_2015_US_LTM

These changes are most easily made manually using Access. As a quick check, let's make sure there are now 95 sites associated with ICPWaters US and 76 sites associated with ICPW_TOCTRENDS_2015_US_LTM



In [31]:

    
# Check numbers of sites
proj_codes = [1679,   # ICPWaters US
              3870]   # ICPW_TOCTRENDS_2015_US_LTM
              
for code in proj_codes:
    # Read stations list
    sql = ('SELECT * '
           'FROM resa2.projects_stations '
           'WHERE project_id = %s' % code)
    df = pd.read_sql_query(sql, engine)
    
    print code, len(df)

1.3. The Mystery of Mud Pond

I wish this section was as exciting as the title makes it sound!

Section 2 of the previous notebook identified two Mud Ponds. The one with RESA2 station_id=37063 is correct, but the other (station_id=23709) has incorrect co-ordinates and is several hundred kilometres from the true location. The incorrect site has become associated with lots of different projects and this means that deleting it is not straightforward. Let's get a list of projects using each of these sites.



In [10]:

    
# Find projects using the different versions of Mud Pond
for site_id in [23709, 37063]:
    sql = ('SELECT project_id, project_name '
           'FROM resa2.projects '
           'WHERE project_id IN (SELECT project_id '
                                'FROM resa2.projects_stations '
                                'WHERE station_id = %s)' % site_id)
    df = pd.read_sql_query(sql, engine)
    
    print 'Projects using station_id %s:\n' % site_id
    print df
    print '\n'









    



Projects using station_id 23709:

    project_id                   project_name
0         1559                     ICP Waters
1         1679                   ICPWaters US
2         1901                  ICPW-Nitrogen
3         1961                   TOCICP_MAINE
4         2240                    ICPW_TRENDS
5         3245      ICPW Regional Trend ESO4*
6         3246  ICPW Regional Trend ECA*+EMG*
7         3247    ICPW Regional Trend TOC/DOC
8         3248        ICPW Regional Trend ANC
9         3249        ICPW Regional Trend Alk
10        3250         ICPW Regional Trend H+
11        3251       ICPW Regional Trend ENO3
12        3765             ICPW_TOCTRENDS2006
13        3785          ICPW_TOCTRENDS2006_AM
14        3789                      TOCICP-US
15        3870     ICPW_TOCTRENDS_2015_US_LTM


Projects using station_id 37063:

   project_id                project_name
0        3870  ICPW_TOCTRENDS_2015_US_LTM

So, the incorrect site has been used in 16 different projects, most (all?) of which are linked to ICP Waters in some way. The only project using the correct site is ICPW_TOCTRENDS_2015_US_LTM. Before I can delete the incorrect site, I need to switch all these other projects to use station_id=37063 instead.

NB: Based on the above output, ICPW_TOCTRENDS_2015_US_LTM currently uses both Mud Ponds, so when I delete the incorrect one, there will only be 75 sites associated with this project. This means I'm still missing a site somewhere compared to John's spreadsheet. Come back to this.

In Access:

Delete the link between station_id=23709 and ICPW_TOCTRENDS_2015_US_LTM in the projects_stations table.
For the other 15 projects, switch the station_id from 23709 to 37063.
Delete all references to site 23709 from the stations_par_values table.

The tricky part is how to deal with the water chemistry results associated with each of these sites.



In [28]:

    
# Read sample info for incorrect site
sql = ('SELECT water_sample_id, station_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 23709')
df_23709 = pd.read_sql_query(sql, engine)

# Read sample info for correct site
sql = ('SELECT water_sample_id, station_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 37063')
df_37063 = pd.read_sql_query(sql, engine)

# Outer join
df = df_23709.merge(df_37063, how='outer',
                    on='sample_date',
                    suffixes=['_23709', '_37063'])

print 'Number of samples associated with site 23709:', len(df_23709)
print 'Number of samples associated with site 37063:', len(df_37063)
print 'Number of samples with common dates:         ', len(df.dropna(how='any'))









    



Number of samples associated with site 23709: 88
Number of samples associated with site 37063: 107
Number of samples with common dates:          63

It looks as though there is at least some overlap here, but it's not easy to see what's going on without some plots. Let's pick a commonly measured parameter ($Cl^-$ in $mg/l$) and plot the values for each site to see whether they're actually the same.



In [52]:

    
# Site 23709
sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 23709)')
df_23709 = pd.read_sql_query(sql, engine)

# Join dates
df_23709 = df_23709.merge(df, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id_23709')

# Tidy df
df_23709.sort_values(by='sample_date', inplace=True, ascending='True')
df_23709.index = df_23709['sample_date']
df_23709 = df_23709[['value']]
df_23709.columns = ['site_23709']

# Site 37063
sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 37063)')
df_37063 = pd.read_sql_query(sql, engine)

# Join dates
df_37063 = df_37063.merge(df, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id_37063')

# Tidy df
df_37063.sort_values(by='sample_date', inplace=True, ascending='True')
df_37063.index = df_37063['sample_date']
df_37063 = df_37063[['value']]
df_37063.columns = ['site_37063']

# Merge
merged_df = df_23709.merge(df_37063, how='outer',
                           left_index=True,
                           right_index=True)

print 'Total number of samples in merged series:', len(merged_df)

# Plot
merged_df.plot(figsize=(12, 8))
plt.title('Chloride concentration in mg/l at two Mud Ponds', fontsize=20)
plt.tight_layout()
mpld3.display()









    



 Total number of samples in merged series: 132






    Out[52]:

It's pretty clear that this is the same data series (although something weird happens in 1996 - use the pan and zoom tools to see this). However, there appears to be no consistent pattern regarding which samples have been assigned to which site. The 1996 discrepancy seems pretty minor, so the best way to fix this is probably as follows:

For dates where samples are recorded at both sites, choose data from site 37063 in preference to those from site 23709.
On dates where there are samples only for site 23709, transfer these to site 37063.

This should create a single series for site 37063 with 132 water samples. Any remaining samples associated with site 23709 can then be deleted (as they're already duplicated for site 37063) and then, with a bit of luck, I should also be able to remove the "bad" Mud Pond from the database without inflicting too much damage elsewhere.



In [60]:

    
# Identify samples associated ONLY with site 23709
trans_df = df[pd.isnull(df['station_id_37063'])]

# These need transferring to site 37063
sql = ('UPDATE resa2.water_samples '
       'SET station_id = 37063 '
       'WHERE water_sample_id IN %s' 
       % str(tuple(trans_df['water_sample_id_23709'].map(int))))

result = conn.execute(sql)

Now plot the results for chloride again to make sure it's worked.



In [3]:

    
# Site 23709
# Get dates
sql = ('SELECT water_sample_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 23709')
dates_23709 = pd.read_sql_query(sql, engine)

sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 23709)')
df_23709 = pd.read_sql_query(sql, engine)

# Join dates
df_23709 = df_23709.merge(dates_23709, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id')

# Tidy df
df_23709.sort_values(by='sample_date', inplace=True, ascending='True')
df_23709.index = df_23709['sample_date']
df_23709 = df_23709[['value']]
df_23709.columns = ['site_23709']

# Site 37063
# Get dates
sql = ('SELECT water_sample_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 37063')
dates_37063 = pd.read_sql_query(sql, engine)

sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 37063)')
df_37063 = pd.read_sql_query(sql, engine)

# Join dates
df_37063 = df_37063.merge(dates_37063, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id')

# Tidy df
df_37063.sort_values(by='sample_date', inplace=True, ascending='True')
df_37063.index = df_37063['sample_date']
df_37063 = df_37063[['value']]
df_37063.columns = ['site_37063']

# Merge
merged_df = df_23709.merge(df_37063, how='outer',
                           left_index=True,
                           right_index=True)

# Plot
merged_df.plot(subplots=True, sharey=True, figsize=(12, 8))
plt.tight_layout()
mpld3.display()









    Out[3]:

This looks OK, so I can now delete all the remaining records associated with station_id=23709...

...Actually, I can't delete these samples. They're protected by the database constraint SS_WSID_FK, which links water_samples to sample_selections. I'm not sure what the sample_selections tables is, and I'm wary of deleting records from tables I don't understand in case it breaks some other component of the wider LabWare-RESA2-NivaBase system.

For now, I'll leave site 23709 in the database, along with the remaining samples that are associated with it. All i've actually done here is transfer 25 water samples from the early record of station 23709 to station 37063. It won't be difficult to undo these changes if necessary, but it would be good to tidy this up properly. Come back to this and find out what the sample_selections table actually does.

1.4. The Mystery of Multiple sites

It seems the Mystery of Mud Pond runs a bit deeper than originally thought. Many of the ICPW sites in RESA2 are duplicated, with the following as a typical example:

station_id	station_code	station_name	lat	long
23661	US26	Biscuit Brook, Catskills	41.9869	-74.5031
37014	X15:1434025	BISCUIT BROOK	42.0111	-74.4147

The first entry (station_id 23661) is associated with the "core" ICP Waters programme, whereas the second (station_id 37014) was added later as part of the trend analysis project. Despite the slight differences in co-ordinates, these are actually the same site.

Adding stations twice is a little dangerous in a relational database, as it breaks the principle of database normalisation. Nevertheless, as Tore points out in his e-mail of 01/09/2016 at 08:51, it is sometimes justified as the most convenient thing to do. There are some potential problems with this, though. For example, when John and Don were here in May, they extracted site data for both the ICPWaters US and ICPW_TOCTRENDS_2015_US_LTM projects and were understandably confused by the different site codes and co-ordinates. More importantly, when sites are duplicated, it's very easy to forget to update both versions when new data comes in. The result, as illustrated by Mud Pond above, is we end up with two patchy data records, neither of which tell the whole story by themselves. This wouldn't be a problem if the ICPWaters XX projects (and the associated sites e.g. US26 in the table above) were "archived" to represent the state of the data at the time of some previous analysis. Any more recent data would then be added to a new project. However, as far as I can tell, this isn't the case: each year new data from the Focal Centres are added to the ICPWaters XX sites, while ICPW_TOCTRENDS_2015_XX is more of a "static" dataset, based on a bulk data upload at a single point in time.

For the US sites, John has produced a spreadsheet that attempts to sort all this out, by linking the RESA2 codes used by ICPWaters to USEPA codes and corrected site properties. The main question now is how to use this information.

Given that the ICPWaters XX projects do not represent an accurate archive of the data at some previous time point (because more data has been added subsequently), and considering that many of the latitudes and longitudes for these stations are now known to be incorrect, there doesn't seem much justification for keeping the duplicate sites. However, because the water samples associated with the individual sites in each pair are different (see the Mud Pond example above), it's not necessarily straightforward to merge the duplicated datasets. A simpler option would be to keep the sites separate but to correct/update the site properties (names, co-ordinates etc.) based on John's spreadsheet, so at least the duplicated sites are consistently duplicated. This would require less effort, but will likely lead to further confusion in the future, and it also means the updated trends analysis will not make use of the full dataset.

1.4.1. Which US sites are in the database?

Rather than worrying about which projects have which sites, a useful preliminary step will be to simply check which of the sites in John's spreadsheet are actually in the database, and whether the properties are correct. Even this isn't entirely straightforward, but I've had a go using the code below.



In [4]:

    
# Read stations table
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
       'FROM resa2.stations')

stn_df = pd.read_sql_query(sql, engine)
stn_df.head()









    Out[4]:






  
    
      
      station_id
      station_code
      station_name
      latitude
      longitude
    
  
  
    
      0
      110
      1845-601
      Tennvatn
      67.593796
      15.487890
    
    
      1
      111
      604-608
      �vre Jerpetjern
      59.608000
      9.427000
    
    
      2
      112
      1101-43
      Glypstadvatnet
      58.487223
      6.188384
    
    
      3
      113
      2030-701
      Serdivatnet
      69.606075
      30.374109
    
    
      4
      115
      831-501
      Brårvatn
      59.294921
      7.727118



In [14]:

    
# Get a subset of columns from John's spreadsheet
us_df = j_df[['Station Code', 'Station name', 'Suggested Name',
              'Active', 'INCLUDE IN ICP WATERS DATABASE', 
              'INCLUDE IN ICP DOC ANALYSIS', 'NFC_SITEID',
              'NFC_SITENAME', 'Latitude', 'Longitude']]

# Rename columns
us_df.columns = ['station_code', 'station_name', 'suggested_name',
                 'active', 'core', 'trend', 'nfc_code', 'nfc_name', 
                 'latitude', 'longitude']

us_df.head()









    Out[14]:






  
    
      
      station_code
      station_name
      suggested_name
      active
      core
      trend
      nfc_code
      nfc_name
      latitude
      longitude
    
  
  
    
      0
      US24
      New York, Catskill Mnt., Rondout Creek
      Rondout Creek, Catskills
      YES
      YES
      YES
      01364959
      RONDOUT CREEK
      41.9367
      -74.3764
    
    
      1
      US25
      W Br Neversink R At Winnisook, Catskills
      West Branch Neversink River at Winnisook, Cats...
      YES
      YES
      YES
      01434021
      W BR NEVERSINK R AT WINNISOOK
      41.9725
      -74.4483
    
    
      2
      US26
      Biscuit Brook, Catskills
      Biscuit Brook, Catskills
      YES
      YES
      YES
      01434025
      BISCUIT BROOK
      42.0111
      -74.4147
    
    
      3
      US23
      New York, Catskill Mnt., E. Branch Neversink, ...
      East Branch Neversink River, Catskills
      YES
      YES
      YES
      0143400680
      EAST BRANCH NEVERSINK, HEADWAT
      41.9869
      -74.5031
    
    
      4
      US27
      Little Hope Pond, Adirondacks
      Little Hope Pond, Adirondacks
      YES
      YES
      NO, road salt contamination
      020058O
      LITTLE HOPE POND
      44.5158
      -74.1252



In [37]:

    
# Dict to store results
res_dict = {'stn_code':[],
            'code_present':[],
            'name_ok':[],
            'lat_ok':[],
            'lon_ok':[]}

# Iterate over spreadsheet
for idx in range(len(us_df)):
    # Get stn id
    stn_cd = us_df.ix[idx, 'station_code']
    name = us_df.ix[idx, 'suggested_name'] # These are John's ideal names
    lat = us_df.ix[idx, 'latitude']
    lon = us_df.ix[idx, 'longitude']
    
    # Check stations table
    q_res = stn_df.query('station_code == @stn_cd')
    
    if len(q_res) == 0:
        # The site isn't present with this code
        res_dict['stn_code'].append(stn_cd)
        res_dict['code_present'].append(0)
        res_dict['name_ok'].append(np.nan)
        res_dict['lat_ok'].append(np.nan)
        res_dict['lon_ok'].append(np.nan)
    
    elif len(q_res) == 1:
        # Check the site properties
        res_dict['stn_code'].append(stn_cd)
        res_dict['code_present'].append(1)
        res_dict['name_ok'].append(name == q_res.station_name.values[0])
        res_dict['lat_ok'].append(lat == q_res.latitude.values[0])
        res_dict['lon_ok'].append(lon == q_res.longitude.values[0])
        
    else:
        # This should be impossible
        raise ValueError('Site code %s appears to be duplicated.' % stn_cd)

# Build df 
res_df = pd.DataFrame(res_dict)
res_df = res_df[['stn_code', 'code_present', 
                 'name_ok', 'lat_ok', 'lon_ok']]

res_df.head()









    



C:\Data\64_Bit_WinPython\python-2.7.10.amd64\lib\site-packages\ipykernel\__main__.py:31: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal






    Out[37]:






  
    
      
      stn_code
      code_present
      name_ok
      lat_ok
      lon_ok
    
  
  
    
      0
      US24
      1
      False
      False
      False
    
    
      1
      US25
      1
      False
      False
      False
    
    
      2
      US26
      1
      True
      False
      False
    
    
      3
      US23
      1
      False
      False
      False
    
    
      4
      US27
      1
      True
      True
      True

Note the Unicode warning here, highlighting that some matches on the name column have failed due to issues with special text characters, so the results are probably a worst case scenario. Nevertheless, some of the sites match exactly. Let's see how many matches and partial matches we have.



In [38]:

    
# How many matches?
print 'Total number of records in spreadsheet:               ', len(res_df)
print 'Of these...'
print 'Number matched by code:                               ', len(res_df.query('code_present==1'))
print 'Number matched by code with correct lat and lon:      ', len(res_df.query('lat_ok==1 & lon_ok==1'))
print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))









    



Total number of records in spreadsheet:                110
Of these...
Number matched by code:                                95
Number matched by code with correct lat and lon:       67
Number matched by code with correct name, lat and lon: 58

The script above uses RESA2 codes taken from John's spreadsheet. For the most part these are the old codes (e.g. US26) associated with the original ICPWaters US project (rather than the more recent codes, such as X15:1434025, used in the trends project). It seems that 58 of these sites have exactly the correct information (name, lat and long), and a further 9 have the correct lat and long, some do not use John's suggested ideal names. A further 28 sites can be matched by code, but have incorrect geographic co-ordinates.

In total, there are 95 matches based on site codes. A further 15 sites (the ones marked NA in John's spreadsheet) cannot be matched at all but, of these, only 4 should be included in the core ICPW dataset and only 2 in the trends analysis. These four sites have already been dealt with above (see the start of section 1.2). They are in the database with suitable codes - it's just that these codes are in John's workbook.

What about the sites created as duplicates for the trend analysis project? How many of those can be matched from John's spreadsheet?



In [69]:

    
# Dict to store results
res_dict = {'nfc_code':[],
            'code_present':[],
            'name_ok':[],
            'lat_ok':[],
            'lon_ok':[]}

# Iterate over spreadsheet
for idx in range(len(us_df)):
    # Get stn id
    nfc_cd = us_df.ix[idx, 'nfc_code']
    name = us_df.ix[idx, 'nfc_name'] # These are John's ideal names
    lat = us_df.ix[idx, 'latitude']
    lon = us_df.ix[idx, 'longitude']
    
    # Check stations table. Need to add 'X15:' and allow for variants
    q_res = stn_df[(stn_df['station_code']=='X15:%s' % nfc_cd) |
                   (stn_df['station_code']=='X15:%s' % nfc_cd[1:]) |
                   (stn_df['station_code']=='X15:%s' % nfc_cd[:-1])]
    
    if len(q_res) == 0:
        # The site isn't present with this code
        res_dict['nfc_code'].append(nfc_cd)
        res_dict['code_present'].append(0)
        res_dict['name_ok'].append(np.nan)
        res_dict['lat_ok'].append(np.nan)
        res_dict['lon_ok'].append(np.nan)
    
    elif len(q_res) == 1:
        # Check the site properties
        res_dict['nfc_code'].append(nfc_cd)
        res_dict['code_present'].append(1)
        res_dict['name_ok'].append(name == q_res.station_name.values[0])
        res_dict['lat_ok'].append(lat == q_res.latitude.values[0])
        res_dict['lon_ok'].append(lon == q_res.longitude.values[0])
        
    else:
        # This should be impossible
        raise ValueError('Site code %s appears to be duplicated.' % stn_cd)

# Build df 
res_df = pd.DataFrame(res_dict)
res_df = res_df[['nfc_code', 'code_present', 
                 'name_ok', 'lat_ok', 'lon_ok']]

res_df.head()









    Out[69]:






  
    
      
      nfc_code
      code_present
      name_ok
      lat_ok
      lon_ok
    
  
  
    
      0
      01364959
      1
      True
      True
      True
    
    
      1
      01434021
      1
      False
      True
      True
    
    
      2
      01434025
      1
      True
      True
      True
    
    
      3
      0143400680
      1
      False
      True
      True
    
    
      4
      020058O
      1
      True
      True
      True



In [70]:

    
# How many matches?
print 'Total number of records in spreadsheet:               ', len(res_df)
print 'Of these...'
print 'Number matched by code:                               ', len(res_df.query('code_present==1'))
print 'Number matched by code with correct lat and lon:      ', len(res_df.query('lat_ok==1 & lon_ok==1'))
print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))









    



Total number of records in spreadsheet:                110
Of these...
Number matched by code:                                89
Number matched by code with correct lat and lon:       74
Number matched by code with correct name, lat and lon: 68

This approach manages to match most (but not necessarily all) of the US sites using th X15: convention, but note that I had to jump through a few hoops to get here: the code above initially tried to find matches just using the raw NFC code with X15: prepended to it, but this failed in many cases. Two causes (there may be others) are:

Leading zeros on the NFC site codes have been truncated upon adding to RESA. For example, station 01364959 is in the database X15:1364959.
Trailing letter codes have been removed (or weren't included originally) upon adding to RESA2. For example, station 1C1-112E is in the database as X15:1C1-112.

Unfortunately, these kinds of discrepancies makes automated cleaning of the database difficult.

A final test for the moment is to see whether I've actually managed to find all the stations that John would like including in the trends analysis.



In [78]:

    
# Filter dfs
match_df = res_df.query('code_present==1')
trend_df = us_df.query('trend == "YES"')

# Join
trend_df = trend_df.merge(match_df, how='left',
                          on='nfc_code')


# Identify records with no match
trend_df[pd.isnull(trend_df['code_present'])]









    Out[78]:






  
    
      
      station_code
      station_name
      suggested_name
      active
      core
      trend
      nfc_code
      nfc_name
      latitude
      longitude
      code_present
      name_ok
      lat_ok
      lon_ok
    
  
  
    
      65
      US74
      Mud Pond, Maine
      Mud Pond, Maine
      YES
      YES
      YES
      1E1-134E
      MUD POND
      44.6330
      -68.0908
      NaN
      NaN
      NaN
      NaN
    
    
      74
      US82
      Newbert Pond, Maine
      Newbert Pond, Maine
      YES
      YES
      YES
      ME-9998E
      NEWBERT POND
      44.3364
      -69.2711
      NaN
      NaN
      NaN
      NaN

So I've manged to match all the sites for the trends work, with two exceptions. The first is (the increasingly troublesome) Mud Pond, which hasn't matched because when I tidied it up above I changed the site code for station 37063 back to the original US74, rather than the X15: version. The second is Newbert Pond, which is in the database as part of the core ICP Waters programme, but apparently hasn't previously been part of the trends analysis.

Summary

This exploration proves that, with a bit of work, I can identify the sites that John would like including in each project and clean up their properties. I haven't really begun to tackle the problem of actually integrating the data, though - that's going to be much more challenging.

It's also worth emphasising that the above was only really possible thanks to John's spreadsheet, so doing the same thing for other countries may be more difficult.

	station_id	station_code	station_name	latitude	longitude	Station Code	NFC_SITEID	NFC_SITENAME	Latitude	Longitude
60	23647.0	US126	Little, Winhall, Vermont	NaN	NaN	NaN	NaN	NaN	NaN	NaN
69	23643.0	US122	Cow Mountain, Vermont	NaN	NaN	NaN	NaN	NaN	NaN	NaN
70	23645.0	US124	Kettle, Vermont	NaN	NaN	NaN	NaN	NaN	NaN	NaN
71	23620.0	1C3-075E	South � Marlboro, Vermont	42.8439	-72.7125	NaN	NaN	NaN	NaN	NaN
95	NaN	NaN	NaN	NaN	NaN	NaN	1A1-029O	MIDDLE POND	44.3369	-74.3719
96	NaN	NaN	NaN	NaN	NaN	NaN	1A2-076O	BARNES LAKE	43.5667	-75.2278
97	NaN	NaN	NaN	NaN	NaN	NaN	1C1-106E	SUCKER	42.8250	-73.1292
98	NaN	NaN	NaN	NaN	NaN	NaN	ME-9997O	TUNK LAKE	44.5986	-68.0592

	station_id	station_code	station_name	latitude	longitude
0	110	1845-601	Tennvatn	67.593796	15.487890
1	111	604-608	�vre Jerpetjern	59.608000	9.427000
2	112	1101-43	Glypstadvatnet	58.487223	6.188384
3	113	2030-701	Serdivatnet	69.606075	30.374109
4	115	831-501	Brårvatn	59.294921	7.727118

	station_code	station_name	suggested_name	active	core	trend	nfc_code	nfc_name	latitude	longitude
0	US24	New York, Catskill Mnt., Rondout Creek	Rondout Creek, Catskills	YES	YES	YES	01364959	RONDOUT CREEK	41.9367	-74.3764
1	US25	W Br Neversink R At Winnisook, Catskills	West Branch Neversink River at Winnisook, Cats...	YES	YES	YES	01434021	W BR NEVERSINK R AT WINNISOOK	41.9725	-74.4483
2	US26	Biscuit Brook, Catskills	Biscuit Brook, Catskills	YES	YES	YES	01434025	BISCUIT BROOK	42.0111	-74.4147
3	US23	New York, Catskill Mnt., E. Branch Neversink, ...	East Branch Neversink River, Catskills	YES	YES	YES	0143400680	EAST BRANCH NEVERSINK, HEADWAT	41.9869	-74.5031
4	US27	Little Hope Pond, Adirondacks	Little Hope Pond, Adirondacks	YES	YES	NO, road salt contamination	020058O	LITTLE HOPE POND	44.5158	-74.1252

	stn_code	code_present	name_ok	lat_ok	lon_ok
0	US24	1	False	False	False
1	US25	1	False	False	False
2	US26	1	True	False	False
3	US23	1	False	False	False
4	US27	1	True	True	True

	station_code	station_name	suggested_name	active	core	trend	nfc_code	nfc_name	latitude	longitude	code_present	name_ok	lat_ok	lon_ok
65	US74	Mud Pond, Maine	Mud Pond, Maine	YES	YES	YES	1E1-134E	MUD POND	44.6330	-68.0908	NaN	NaN	NaN	NaN
74	US82	Newbert Pond, Maine	Newbert Pond, Maine	YES	YES	YES	ME-9998E	NEWBERT POND	44.3364	-69.2711	NaN	NaN	NaN	NaN