In [29]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, imp, glob, os, numpy as np
from sqlalchemy import create_engine
sn.set_context('notebook')

TOC trends 2015: database clean-up (part 3)

Back from holiday and trying to remember where I'd got to with the ICPW database. John has sent detailed replies to my questions regarding the US sites (see e-mails received 25/08/2016 at 01:45 and 03:33), so this seems like a good place to start.

1. Checking US sites

My initial clean-up of RESA2 ended with 72 sites linked to the ICPW_TOCTRENDS_2015_US_LTM project (see section2 of this notebook for details) and, following advice from Tore, I also shifted 19 US sites into a new project called ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED. In addition, manually checking RESA2 identifies 95 sites associated with the ICPWaters US project.

Based on the information in John's emails, some of this needs revisiting.

1.1. Connect to db and check site numbers are as expected


In [2]:
# Import custom functions and connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
                    r'\useful_resa2_code.py')

resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)

engine, conn = resa2_basic.connect_to_resa2()

In [4]:
# Check numbers of sites
proj_codes = [1679,   # ICPWaters US
              3870,   # ICPW_TOCTRENDS_2015_US_LTM
              4150]   # ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED

for code in proj_codes:
    # Read stations list
    sql = ('SELECT * '
           'FROM resa2.projects_stations '
           'WHERE project_id = %s' % code)
    df = pd.read_sql_query(sql, engine)
    
    print code, len(df)


1679 95
3870 72
4150 19

1.2. Check sites against John's spreadsheet

Based on John's e-mail, four of the sites that I previously excluded should actually be included in the trends project, so I need to move these back across:

Site ID Site name
040874O Brook Trout Lake, Adirondacks
1A1-109O Moss Lake, Adirondacks
1A1-110O Lake Rondaxe, Adirondacks
ME-4474 Duck Pond, Maine

John has sent a spreadsheet (U.S.Site.Reconciliation.August.2016.xlsx) listing which sites should be included in which projects. First, compare the number of sites in John's spreadsheet with the numbers in RESA2.


In [3]:
# Read John's spreadsheet
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Call_for_Data_2016'
           '\Replies\usa\U.S.Site.Reconciliation.August.2016.xlsx')

j_df = pd.read_excel(in_xls, sheetname='Sheet1')

print 'John, core ICPW sites: ', len(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'])
print 'John, DOC trends sites:', len(j_df[j_df['INCLUDE IN ICP DOC ANALYSIS']=='YES'])


John, core ICPW sites:  95
John, DOC trends sites: 76

This looks promising. With a bit of luck, I can just move the four sites listed above from the EXCLUDED project back into ICPW_TOCTRENDS_2015_US_LTM and everything will be OK...

...actually, no - it's not that simple. Of the 95 ICPWaters US sites in the spreadsheet, John only has RESA2 codes for 91 of them (four are marked NA). This implies the 95 sites currently in RESA2 are not the same as the 95 in the spreadsheet, so I need to work out where the discrepancies are.


In [28]:
# Get the 95 'ICPWaters US' sites from RESA2
sql = ('SELECT * FROM resa2.stations '
       'WHERE station_id IN (SELECT station_id '
                            'FROM resa2.projects_stations '
                            'WHERE project_id = 1679)')
core_df = pd.read_sql_query(sql, engine)

# Outer join based on RESA2 station code
df = core_df.merge(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'], 
                   how='outer', 
                   left_on='station_code',
                   right_on='Station Code')

# Cols of interest
cols = ['station_id', 'station_code', 'station_name', 'latitude', 'longitude',
        'Station Code', 'NFC_SITEID', 'NFC_SITENAME', 'Latitude', 'Longitude']
df = df[cols]

# Get entries where no match was found
df[pd.isnull(df['station_code']) | pd.isnull(df['Station Code'])]


Out[28]:
station_id station_code station_name latitude longitude Station Code NFC_SITEID NFC_SITENAME Latitude Longitude
60 23647.0 US126 Little, Winhall, Vermont NaN NaN NaN NaN NaN NaN NaN
69 23643.0 US122 Cow Mountain, Vermont NaN NaN NaN NaN NaN NaN NaN
70 23645.0 US124 Kettle, Vermont NaN NaN NaN NaN NaN NaN NaN
71 23620.0 1C3-075E South � Marlboro, Vermont 42.8439 -72.7125 NaN NaN NaN NaN NaN
95 NaN NaN NaN NaN NaN NaN 1A1-029O MIDDLE POND 44.3369 -74.3719
96 NaN NaN NaN NaN NaN NaN 1A2-076O BARNES LAKE 43.5667 -75.2278
97 NaN NaN NaN NaN NaN NaN 1C1-106E SUCKER 42.8250 -73.1292
98 NaN NaN NaN NaN NaN NaN ME-9997O TUNK LAKE 44.5986 -68.0592

So, there are four stations (rows 60, 69, 70 and 71 in the table above) currently included in the ICPWaters US project that should be removed. There are also four stations (rows 95 to 98 above) that are present in John's spreadsheet but not associated with the project. Of these, two of them (Middle Pond and Tunk) are already in the database but associated with the ICPW_TOCTRENDS_2015_US_LTM project only, while one of them (Sucker) has been shifted to the ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED project. The remaining site (Barnes Lake) does not appear anywhere in the database as far as I can tell.

Next steps:

  • Remove four sites (Little, Cow Mountain, Kettle and South Marlboro) from the ICPWaters US project (and transfer them to EXCLUDED)
  • Add Middle Pond, Sucker and Tunk Lake to the ICPWaters US project
  • Create a new site for Barnes Lake and link it to the ICPWaters US project
  • Move four sites (Brook Trout Lake, Moss Lake, Lake Rondaxe and Duck Pond) from ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED to ICPW_TOCTRENDS_2015_US_LTM

These changes are most easily made manually using Access. As a quick check, let's make sure there are now 95 sites associated with ICPWaters US and 76 sites associated with ICPW_TOCTRENDS_2015_US_LTM


In [31]:
# Check numbers of sites
proj_codes = [1679,   # ICPWaters US
              3870]   # ICPW_TOCTRENDS_2015_US_LTM
              
for code in proj_codes:
    # Read stations list
    sql = ('SELECT * '
           'FROM resa2.projects_stations '
           'WHERE project_id = %s' % code)
    df = pd.read_sql_query(sql, engine)
    
    print code, len(df)


1679 95
3870 76

1.3. The Mystery of Mud Pond

I wish this section was as exciting as the title makes it sound!

Section 2 of the previous notebook identified two Mud Ponds. The one with RESA2 station_id=37063 is correct, but the other (station_id=23709) has incorrect co-ordinates and is several hundred kilometres from the true location. The incorrect site has become associated with lots of different projects and this means that deleting it is not straightforward. Let's get a list of projects using each of these sites.


In [10]:
# Find projects using the different versions of Mud Pond
for site_id in [23709, 37063]:
    sql = ('SELECT project_id, project_name '
           'FROM resa2.projects '
           'WHERE project_id IN (SELECT project_id '
                                'FROM resa2.projects_stations '
                                'WHERE station_id = %s)' % site_id)
    df = pd.read_sql_query(sql, engine)
    
    print 'Projects using station_id %s:\n' % site_id
    print df
    print '\n'


Projects using station_id 23709:

    project_id                   project_name
0         1559                     ICP Waters
1         1679                   ICPWaters US
2         1901                  ICPW-Nitrogen
3         1961                   TOCICP_MAINE
4         2240                    ICPW_TRENDS
5         3245      ICPW Regional Trend ESO4*
6         3246  ICPW Regional Trend ECA*+EMG*
7         3247    ICPW Regional Trend TOC/DOC
8         3248        ICPW Regional Trend ANC
9         3249        ICPW Regional Trend Alk
10        3250         ICPW Regional Trend H+
11        3251       ICPW Regional Trend ENO3
12        3765             ICPW_TOCTRENDS2006
13        3785          ICPW_TOCTRENDS2006_AM
14        3789                      TOCICP-US
15        3870     ICPW_TOCTRENDS_2015_US_LTM


Projects using station_id 37063:

   project_id                project_name
0        3870  ICPW_TOCTRENDS_2015_US_LTM


So, the incorrect site has been used in 16 different projects, most (all?) of which are linked to ICP Waters in some way. The only project using the correct site is ICPW_TOCTRENDS_2015_US_LTM. Before I can delete the incorrect site, I need to switch all these other projects to use station_id=37063 instead.

NB: Based on the above output, ICPW_TOCTRENDS_2015_US_LTM currently uses both Mud Ponds, so when I delete the incorrect one, there will only be 75 sites associated with this project. This means I'm still missing a site somewhere compared to John's spreadsheet. Come back to this.

In Access:

  • Delete the link between station_id=23709 and ICPW_TOCTRENDS_2015_US_LTM in the projects_stations table.
  • For the other 15 projects, switch the station_id from 23709 to 37063.
  • Delete all references to site 23709 from the stations_par_values table.

The tricky part is how to deal with the water chemistry results associated with each of these sites.


In [28]:
# Read sample info for incorrect site
sql = ('SELECT water_sample_id, station_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 23709')
df_23709 = pd.read_sql_query(sql, engine)

# Read sample info for correct site
sql = ('SELECT water_sample_id, station_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 37063')
df_37063 = pd.read_sql_query(sql, engine)

# Outer join
df = df_23709.merge(df_37063, how='outer',
                    on='sample_date',
                    suffixes=['_23709', '_37063'])

print 'Number of samples associated with site 23709:', len(df_23709)
print 'Number of samples associated with site 37063:', len(df_37063)
print 'Number of samples with common dates:         ', len(df.dropna(how='any'))


Number of samples associated with site 23709: 88
Number of samples associated with site 37063: 107
Number of samples with common dates:          63

It looks as though there is at least some overlap here, but it's not easy to see what's going on without some plots. Let's pick a commonly measured parameter ($Cl^-$ in $mg/l$) and plot the values for each site to see whether they're actually the same.


In [52]:
# Site 23709
sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 23709)')
df_23709 = pd.read_sql_query(sql, engine)

# Join dates
df_23709 = df_23709.merge(df, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id_23709')

# Tidy df
df_23709.sort_values(by='sample_date', inplace=True, ascending='True')
df_23709.index = df_23709['sample_date']
df_23709 = df_23709[['value']]
df_23709.columns = ['site_23709']

# Site 37063
sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 37063)')
df_37063 = pd.read_sql_query(sql, engine)

# Join dates
df_37063 = df_37063.merge(df, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id_37063')

# Tidy df
df_37063.sort_values(by='sample_date', inplace=True, ascending='True')
df_37063.index = df_37063['sample_date']
df_37063 = df_37063[['value']]
df_37063.columns = ['site_37063']

# Merge
merged_df = df_23709.merge(df_37063, how='outer',
                           left_index=True,
                           right_index=True)

print 'Total number of samples in merged series:', len(merged_df)

# Plot
merged_df.plot(figsize=(12, 8))
plt.title('Chloride concentration in mg/l at two Mud Ponds', fontsize=20)
plt.tight_layout()
mpld3.display()


 Total number of samples in merged series: 132
Out[52]:

It's pretty clear that this is the same data series (although something weird happens in 1996 - use the pan and zoom tools to see this). However, there appears to be no consistent pattern regarding which samples have been assigned to which site. The 1996 discrepancy seems pretty minor, so the best way to fix this is probably as follows:

  • For dates where samples are recorded at both sites, choose data from site 37063 in preference to those from site 23709.
  • On dates where there are samples only for site 23709, transfer these to site 37063.

This should create a single series for site 37063 with 132 water samples. Any remaining samples associated with site 23709 can then be deleted (as they're already duplicated for site 37063) and then, with a bit of luck, I should also be able to remove the "bad" Mud Pond from the database without inflicting too much damage elsewhere.


In [60]:
# Identify samples associated ONLY with site 23709
trans_df = df[pd.isnull(df['station_id_37063'])]

# These need transferring to site 37063
sql = ('UPDATE resa2.water_samples '
       'SET station_id = 37063 '
       'WHERE water_sample_id IN %s' 
       % str(tuple(trans_df['water_sample_id_23709'].map(int))))

result = conn.execute(sql)

Now plot the results for chloride again to make sure it's worked.


In [3]:
# Site 23709
# Get dates
sql = ('SELECT water_sample_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 23709')
dates_23709 = pd.read_sql_query(sql, engine)

sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 23709)')
df_23709 = pd.read_sql_query(sql, engine)

# Join dates
df_23709 = df_23709.merge(dates_23709, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id')

# Tidy df
df_23709.sort_values(by='sample_date', inplace=True, ascending='True')
df_23709.index = df_23709['sample_date']
df_23709 = df_23709[['value']]
df_23709.columns = ['site_23709']

# Site 37063
# Get dates
sql = ('SELECT water_sample_id, sample_date '
       'FROM resa2.water_samples '
       'WHERE station_id = 37063')
dates_37063 = pd.read_sql_query(sql, engine)

sql = ('SELECT sample_id, value '
       'FROM resa2.water_chemistry_values2 '
       'WHERE method_id = 10253 '
       'AND sample_id IN (SELECT water_sample_id '
                         'FROM resa2.water_samples '
                         'WHERE station_id = 37063)')
df_37063 = pd.read_sql_query(sql, engine)

# Join dates
df_37063 = df_37063.merge(dates_37063, how='left',
                          left_on='sample_id',
                          right_on='water_sample_id')

# Tidy df
df_37063.sort_values(by='sample_date', inplace=True, ascending='True')
df_37063.index = df_37063['sample_date']
df_37063 = df_37063[['value']]
df_37063.columns = ['site_37063']

# Merge
merged_df = df_23709.merge(df_37063, how='outer',
                           left_index=True,
                           right_index=True)

# Plot
merged_df.plot(subplots=True, sharey=True, figsize=(12, 8))
plt.tight_layout()
mpld3.display()


Out[3]:

This looks OK, so I can now delete all the remaining records associated with station_id=23709...

...Actually, I can't delete these samples. They're protected by the database constraint SS_WSID_FK, which links water_samples to sample_selections. I'm not sure what the sample_selections tables is, and I'm wary of deleting records from tables I don't understand in case it breaks some other component of the wider LabWare-RESA2-NivaBase system.

For now, I'll leave site 23709 in the database, along with the remaining samples that are associated with it. All i've actually done here is transfer 25 water samples from the early record of station 23709 to station 37063. It won't be difficult to undo these changes if necessary, but it would be good to tidy this up properly. Come back to this and find out what the sample_selections table actually does.

1.4. The Mystery of Multiple sites

It seems the Mystery of Mud Pond runs a bit deeper than originally thought. Many of the ICPW sites in RESA2 are duplicated, with the following as a typical example:

station_id station_code station_name lat long
23661 US26 Biscuit Brook, Catskills 41.9869 -74.5031
37014 X15:1434025 BISCUIT BROOK 42.0111 -74.4147

The first entry (station_id 23661) is associated with the "core" ICP Waters programme, whereas the second (station_id 37014) was added later as part of the trend analysis project. Despite the slight differences in co-ordinates, these are actually the same site.

Adding stations twice is a little dangerous in a relational database, as it breaks the principle of database normalisation. Nevertheless, as Tore points out in his e-mail of 01/09/2016 at 08:51, it is sometimes justified as the most convenient thing to do. There are some potential problems with this, though. For example, when John and Don were here in May, they extracted site data for both the ICPWaters US and ICPW_TOCTRENDS_2015_US_LTM projects and were understandably confused by the different site codes and co-ordinates. More importantly, when sites are duplicated, it's very easy to forget to update both versions when new data comes in. The result, as illustrated by Mud Pond above, is we end up with two patchy data records, neither of which tell the whole story by themselves. This wouldn't be a problem if the ICPWaters XX projects (and the associated sites e.g. US26 in the table above) were "archived" to represent the state of the data at the time of some previous analysis. Any more recent data would then be added to a new project. However, as far as I can tell, this isn't the case: each year new data from the Focal Centres are added to the ICPWaters XX sites, while ICPW_TOCTRENDS_2015_XX is more of a "static" dataset, based on a bulk data upload at a single point in time.

For the US sites, John has produced a spreadsheet that attempts to sort all this out, by linking the RESA2 codes used by ICPWaters to USEPA codes and corrected site properties. The main question now is how to use this information.

Given that the ICPWaters XX projects do not represent an accurate archive of the data at some previous time point (because more data has been added subsequently), and considering that many of the latitudes and longitudes for these stations are now known to be incorrect, there doesn't seem much justification for keeping the duplicate sites. However, because the water samples associated with the individual sites in each pair are different (see the Mud Pond example above), it's not necessarily straightforward to merge the duplicated datasets. A simpler option would be to keep the sites separate but to correct/update the site properties (names, co-ordinates etc.) based on John's spreadsheet, so at least the duplicated sites are consistently duplicated. This would require less effort, but will likely lead to further confusion in the future, and it also means the updated trends analysis will not make use of the full dataset.

1.4.1. Which US sites are in the database?

Rather than worrying about which projects have which sites, a useful preliminary step will be to simply check which of the sites in John's spreadsheet are actually in the database, and whether the properties are correct. Even this isn't entirely straightforward, but I've had a go using the code below.


In [4]:
# Read stations table
sql = ('SELECT station_id, station_code, station_name, latitude, longitude '
       'FROM resa2.stations')

stn_df = pd.read_sql_query(sql, engine)
stn_df.head()


Out[4]:
station_id station_code station_name latitude longitude
0 110 1845-601 Tennvatn 67.593796 15.487890
1 111 604-608 �vre Jerpetjern 59.608000 9.427000
2 112 1101-43 Glypstadvatnet 58.487223 6.188384
3 113 2030-701 Serdivatnet 69.606075 30.374109
4 115 831-501 Brårvatn 59.294921 7.727118

In [14]:
# Get a subset of columns from John's spreadsheet
us_df = j_df[['Station Code', 'Station name', 'Suggested Name',
              'Active', 'INCLUDE IN ICP WATERS DATABASE', 
              'INCLUDE IN ICP DOC ANALYSIS', 'NFC_SITEID',
              'NFC_SITENAME', 'Latitude', 'Longitude']]

# Rename columns
us_df.columns = ['station_code', 'station_name', 'suggested_name',
                 'active', 'core', 'trend', 'nfc_code', 'nfc_name', 
                 'latitude', 'longitude']

us_df.head()


Out[14]:
station_code station_name suggested_name active core trend nfc_code nfc_name latitude longitude
0 US24 New York, Catskill Mnt., Rondout Creek Rondout Creek, Catskills YES YES YES 01364959 RONDOUT CREEK 41.9367 -74.3764
1 US25 W Br Neversink R At Winnisook, Catskills West Branch Neversink River at Winnisook, Cats... YES YES YES 01434021 W BR NEVERSINK R AT WINNISOOK 41.9725 -74.4483
2 US26 Biscuit Brook, Catskills Biscuit Brook, Catskills YES YES YES 01434025 BISCUIT BROOK 42.0111 -74.4147
3 US23 New York, Catskill Mnt., E. Branch Neversink, ... East Branch Neversink River, Catskills YES YES YES 0143400680 EAST BRANCH NEVERSINK, HEADWAT 41.9869 -74.5031
4 US27 Little Hope Pond, Adirondacks Little Hope Pond, Adirondacks YES YES NO, road salt contamination 020058O LITTLE HOPE POND 44.5158 -74.1252

In [37]:
# Dict to store results
res_dict = {'stn_code':[],
            'code_present':[],
            'name_ok':[],
            'lat_ok':[],
            'lon_ok':[]}

# Iterate over spreadsheet
for idx in range(len(us_df)):
    # Get stn id
    stn_cd = us_df.ix[idx, 'station_code']
    name = us_df.ix[idx, 'suggested_name'] # These are John's ideal names
    lat = us_df.ix[idx, 'latitude']
    lon = us_df.ix[idx, 'longitude']
    
    # Check stations table
    q_res = stn_df.query('station_code == @stn_cd')
    
    if len(q_res) == 0:
        # The site isn't present with this code
        res_dict['stn_code'].append(stn_cd)
        res_dict['code_present'].append(0)
        res_dict['name_ok'].append(np.nan)
        res_dict['lat_ok'].append(np.nan)
        res_dict['lon_ok'].append(np.nan)
    
    elif len(q_res) == 1:
        # Check the site properties
        res_dict['stn_code'].append(stn_cd)
        res_dict['code_present'].append(1)
        res_dict['name_ok'].append(name == q_res.station_name.values[0])
        res_dict['lat_ok'].append(lat == q_res.latitude.values[0])
        res_dict['lon_ok'].append(lon == q_res.longitude.values[0])
        
    else:
        # This should be impossible
        raise ValueError('Site code %s appears to be duplicated.' % stn_cd)

# Build df 
res_df = pd.DataFrame(res_dict)
res_df = res_df[['stn_code', 'code_present', 
                 'name_ok', 'lat_ok', 'lon_ok']]

res_df.head()


C:\Data\64_Bit_WinPython\python-2.7.10.amd64\lib\site-packages\ipykernel\__main__.py:31: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Out[37]:
stn_code code_present name_ok lat_ok lon_ok
0 US24 1 False False False
1 US25 1 False False False
2 US26 1 True False False
3 US23 1 False False False
4 US27 1 True True True

Note the Unicode warning here, highlighting that some matches on the name column have failed due to issues with special text characters, so the results are probably a worst case scenario. Nevertheless, some of the sites match exactly. Let's see how many matches and partial matches we have.


In [38]:
# How many matches?
print 'Total number of records in spreadsheet:               ', len(res_df)
print 'Of these...'
print 'Number matched by code:                               ', len(res_df.query('code_present==1'))
print 'Number matched by code with correct lat and lon:      ', len(res_df.query('lat_ok==1 & lon_ok==1'))
print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))


Total number of records in spreadsheet:                110
Of these...
Number matched by code:                                95
Number matched by code with correct lat and lon:       67
Number matched by code with correct name, lat and lon: 58

The script above uses RESA2 codes taken from John's spreadsheet. For the most part these are the old codes (e.g. US26) associated with the original ICPWaters US project (rather than the more recent codes, such as X15:1434025, used in the trends project). It seems that 58 of these sites have exactly the correct information (name, lat and long), and a further 9 have the correct lat and long, some do not use John's suggested ideal names. A further 28 sites can be matched by code, but have incorrect geographic co-ordinates.

In total, there are 95 matches based on site codes. A further 15 sites (the ones marked NA in John's spreadsheet) cannot be matched at all but, of these, only 4 should be included in the core ICPW dataset and only 2 in the trends analysis. These four sites have already been dealt with above (see the start of section 1.2). They are in the database with suitable codes - it's just that these codes are in John's workbook.

What about the sites created as duplicates for the trend analysis project? How many of those can be matched from John's spreadsheet?


In [69]:
# Dict to store results
res_dict = {'nfc_code':[],
            'code_present':[],
            'name_ok':[],
            'lat_ok':[],
            'lon_ok':[]}

# Iterate over spreadsheet
for idx in range(len(us_df)):
    # Get stn id
    nfc_cd = us_df.ix[idx, 'nfc_code']
    name = us_df.ix[idx, 'nfc_name'] # These are John's ideal names
    lat = us_df.ix[idx, 'latitude']
    lon = us_df.ix[idx, 'longitude']
    
    # Check stations table. Need to add 'X15:' and allow for variants
    q_res = stn_df[(stn_df['station_code']=='X15:%s' % nfc_cd) |
                   (stn_df['station_code']=='X15:%s' % nfc_cd[1:]) |
                   (stn_df['station_code']=='X15:%s' % nfc_cd[:-1])]
    
    if len(q_res) == 0:
        # The site isn't present with this code
        res_dict['nfc_code'].append(nfc_cd)
        res_dict['code_present'].append(0)
        res_dict['name_ok'].append(np.nan)
        res_dict['lat_ok'].append(np.nan)
        res_dict['lon_ok'].append(np.nan)
    
    elif len(q_res) == 1:
        # Check the site properties
        res_dict['nfc_code'].append(nfc_cd)
        res_dict['code_present'].append(1)
        res_dict['name_ok'].append(name == q_res.station_name.values[0])
        res_dict['lat_ok'].append(lat == q_res.latitude.values[0])
        res_dict['lon_ok'].append(lon == q_res.longitude.values[0])
        
    else:
        # This should be impossible
        raise ValueError('Site code %s appears to be duplicated.' % stn_cd)

# Build df 
res_df = pd.DataFrame(res_dict)
res_df = res_df[['nfc_code', 'code_present', 
                 'name_ok', 'lat_ok', 'lon_ok']]

res_df.head()


Out[69]:
nfc_code code_present name_ok lat_ok lon_ok
0 01364959 1 True True True
1 01434021 1 False True True
2 01434025 1 True True True
3 0143400680 1 False True True
4 020058O 1 True True True

In [70]:
# How many matches?
print 'Total number of records in spreadsheet:               ', len(res_df)
print 'Of these...'
print 'Number matched by code:                               ', len(res_df.query('code_present==1'))
print 'Number matched by code with correct lat and lon:      ', len(res_df.query('lat_ok==1 & lon_ok==1'))
print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))


Total number of records in spreadsheet:                110
Of these...
Number matched by code:                                89
Number matched by code with correct lat and lon:       74
Number matched by code with correct name, lat and lon: 68

This approach manages to match most (but not necessarily all) of the US sites using th X15: convention, but note that I had to jump through a few hoops to get here: the code above initially tried to find matches just using the raw NFC code with X15: prepended to it, but this failed in many cases. Two causes (there may be others) are:

  • Leading zeros on the NFC site codes have been truncated upon adding to RESA. For example, station 01364959 is in the database X15:1364959.

  • Trailing letter codes have been removed (or weren't included originally) upon adding to RESA2. For example, station 1C1-112E is in the database as X15:1C1-112.

Unfortunately, these kinds of discrepancies makes automated cleaning of the database difficult.

A final test for the moment is to see whether I've actually managed to find all the stations that John would like including in the trends analysis.


In [78]:
# Filter dfs
match_df = res_df.query('code_present==1')
trend_df = us_df.query('trend == "YES"')

# Join
trend_df = trend_df.merge(match_df, how='left',
                          on='nfc_code')


# Identify records with no match
trend_df[pd.isnull(trend_df['code_present'])]


Out[78]:
station_code station_name suggested_name active core trend nfc_code nfc_name latitude longitude code_present name_ok lat_ok lon_ok
65 US74 Mud Pond, Maine Mud Pond, Maine YES YES YES 1E1-134E MUD POND 44.6330 -68.0908 NaN NaN NaN NaN
74 US82 Newbert Pond, Maine Newbert Pond, Maine YES YES YES ME-9998E NEWBERT POND 44.3364 -69.2711 NaN NaN NaN NaN

So I've manged to match all the sites for the trends work, with two exceptions. The first is (the increasingly troublesome) Mud Pond, which hasn't matched because when I tidied it up above I changed the site code for station 37063 back to the original US74, rather than the X15: version. The second is Newbert Pond, which is in the database as part of the core ICP Waters programme, but apparently hasn't previously been part of the trends analysis.

Summary

This exploration proves that, with a bit of work, I can identify the sites that John would like including in each project and clean up their properties. I haven't really begun to tackle the problem of actually integrating the data, though - that's going to be much more challenging.

It's also worth emphasising that the above was only really possible thanks to John's spreadsheet, so doing the same thing for other countries may be more difficult.