In [3]:

    
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, mpld3
import pandas as pd, os, glob, numpy as np, imp
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sqlalchemy import create_engine
sn.set_context('notebook')

TOC trends 2015: database clean-up (part 1)

My "to do" list for RESA2 following the meeting on 27/05/2016 is on the network here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\JamesS\RESA2_To_Do_HWI.docx

I want to start working through these items, prioritising stuff relating to the TOC trends analysis because I know I'm running a bit behind here. I also want to document my changes carefully as this is the first time I've modified the "live" version of RESA2, and I need to be able to undo things again in case I mess it all up. This notebook documents my workflow.

NB for James: Some of the code below modifies RESA2. It also documents what the database looked like before making any changes. Do not re-run this workbook without checking the code below first, as the results will change and you'll lose the record of what you've done.

1. Connect Python to RESA2

As a first step, I'd like to try connecting Python directly to the Oracle database underlying RESA2. If this works, I should be able to interact with the database directly from my code (i.e. bypassing the RESA2 interface), which will make some of the steps below much easier.

As a quick test, start by creating a connection and running a simple query against the database.



In [2]:

    
# Create connection

# Use custom RESA2 function to connect to db
r2_func_path = r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template\useful_resa2_code.py'
resa2 = imp.load_source('useful_resa2_code', r2_func_path)

engine, conn = resa2.connect_to_resa2()

# Test SQL statement
sql = ('SELECT project_id, project_number, project_name '
       'FROM resa2.projects')
df = pd.read_sql_query(sql, engine)

df.head(10)









    Out[2]:






  
    
      
      project_id
      project_number
      project_name
    
  
  
    
      0
      1130
      None
      FJORDA
    
    
      1
      2945
      O 29010 06
      Otra sidebekker
    
    
      2
      2965
      O 29264 02
      VRD-Telemark-B�rsesj�
    
    
      3
      1133
      None
      FLAGSTAD
    
    
      4
      1135
      None
      FLORID
    
    
      5
      1140
      None
      FROMLAB
    
    
      6
      1141
      None
      FROYSET
    
    
      7
      1142
      None
      GAULA
    
    
      8
      1143
      None
      GJELLESOGN
    
    
      9
      1144
      None
      GPVEST

Woohoo - that was much easier than expected! Right, on with the data cleaning, staring with the easiest stuff first...

2. Remove some of US sites from the analysis

The definitive list of US LTM sites to be included in the analysis is attached to John’s e-mail from 26/05/2016 at 16:35. After discussion with Tore, the best way to remove the unwanted sites is to create a new project called XXX_US_LTM_EXCLUDED and shift the sites marked EXCLUDE in John's spreadsheet over to that.

First, let's see how the US sites currently in the database compare to those in John's spreadsheet. Start by reading the spreadsheet.



In [3]:

    
# Read John's list of US sites
us_xlsx = (r'\\niva-of5\osl-userdata$\JES\Documents\James_Work\Staff\Heleen_d_W\ICP_Waters'
           r'\TOC_Trends_Analysis_2015\Data\US.sites.include.exclude.xlsx')
j_df = pd.read_excel(us_xlsx, sheetname='Sheet1')

print 'Total number of US sites in spreadsheet: %s.' % len(j_df)
j_df.head()









    



Total number of US sites in spreadsheet: 87.






    Out[3]:






  
    
      
      Station ID
      Station Code
      Station name
      Latitude
      Longitude
      INC_EXC
    
  
  
    
      0
      23683
      US48
      Willis Lake, Adirondacks
      43.3714
      -74.2463
      EXCLUDE
    
    
      1
      23709
      US74
      Mud Pond, Maine
      44.6306
      -65.0939
      INCLUDE
    
    
      2
      36982
      X15:020058O
      LITTLE HOPE POND
      44.5158
      -74.1252
      EXCLUDE
    
    
      3
      36983
      X15:020059O
      BIG HOPE POND
      NaN
      NaN
      EXCLUDE
    
    
      4
      36984
      X15:020138O
      EAST COPPERAS POND
      44.3119
      -74.3722
      INCLUDE

Next, a quick manual check in RESA2 shows that the ICPW_TOCTRENDS_2015_US_LTM project has PROJECT_ID=3870, so let's get the sites associated with that project from the database.



In [4]:

    
# Get the ICPW_TOCTRENDS_2015_US_LTM sites from RESA2
sql = ('SELECT * '
       'FROM resa2.projects_stations '
       'WHERE project_id = 3870')
r_df = pd.read_sql_query(sql, engine)

print 'Total number of US sites in RESA2: %s.' % len(r_df)
r_df.head()









    



Total number of US sites in RESA2: 90.






    Out[4]:






  
    
      
      station_id
      project_id
      active
      station_code
      station_name
    
  
  
    
      0
      36982
      3870
      Y
      020058O
      None
    
    
      1
      36983
      3870
      Y
      020059O
      None
    
    
      2
      36984
      3870
      Y
      020138O
      None
    
    
      3
      36985
      3870
      Y
      020188E
      None
    
    
      4
      36986
      3870
      Y
      020197E
      None

So there are 3 more sites in RESA2 than in John's spreadsheet (and that's including the sites John would like removing). It looks as though the station IDs are compatible between the two datasets, so to try to work out what's going on we can match based on these columns and (hopefully) identify the 3 mystery sites.



In [5]:

    
all_df = pd.merge(r_df, j_df, how='left', 
                  left_on='station_id', right_on='Station ID')
all_df[pd.isnull(all_df['Station ID'])]









    Out[5]:






  
    
      
      station_id
      project_id
      active
      station_code
      station_name
      Station ID
      Station Code
      Station name
      Latitude
      Longitude
      INC_EXC
    
  
  
    
      16
      36999
      3870
      Y
      040874O
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      44
      37027
      3870
      Y
      1A1-109O
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      45
      37028
      3870
      Y
      1A1-110O
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      64
      37047
      3870
      Y
      1C1-106
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      85
      37070
      3870
      Y
      ME-4474
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

Argh! There are 5 sites in the table above, not 3. This implies that:

There are sites in RESA2 that are not in John's spreadsheet, and
There are sites in John's spreadsheet that are not in RESA2.

Repeating the above code using an outer join rather then a left join should make this apparent.



In [6]:

    
all_df = pd.merge(r_df, j_df, how='outer', 
                  left_on='station_id', right_on='Station ID')
all_df[pd.isnull(all_df['Station ID']) | pd.isnull(all_df['station_id'])]









    Out[6]:






  
    
      
      station_id
      project_id
      active
      station_code
      station_name
      Station ID
      Station Code
      Station name
      Latitude
      Longitude
      INC_EXC
    
  
  
    
      16
      36999.0
      3870.0
      Y
      040874O
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      44
      37027.0
      3870.0
      Y
      1A1-109O
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      45
      37028.0
      3870.0
      Y
      1A1-110O
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      64
      37047.0
      3870.0
      Y
      1C1-106
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      85
      37070.0
      3870.0
      Y
      ME-4474
      None
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      90
      NaN
      NaN
      NaN
      NaN
      NaN
      23683.0
      US48
      Willis Lake, Adirondacks
      43.3714
      -74.2463
      EXCLUDE
    
    
      91
      NaN
      NaN
      NaN
      NaN
      NaN
      23709.0
      US74
      Mud Pond, Maine
      44.6306
      -65.0939
      INCLUDE

The first 5 rows of this table are sites that appear in RESA2, but which are not in John's spreadsheet; the last two rows appear in John's spreadsheet but are not in RESA2 (or, at least, they're not associated with the ICPW_TOCTRENDS_2015_US_LTM project).

Decide to do the following:

Assume the 5 sites not mentioned in John's spreadsheet are not needed in the analysis, so exclude them from RESA2 along with the others marked for exclusion (check this with John).
Add one new site (US74; Mud Pond, Maine) to the project (or database?).
Do nothing for Willis Lake, Adirondacks, because John has it marked for exclusion anyway.

NB: There is another Mud Pond in both RESA2 and John's spreadsheet (station ID 37063; site code 1E1-134). This Mud Pond has an almost identical latitude to Mud Pond, Maine, but a significantly different longitude. I assume these really are two different sites, as specified in John's spreadsheet, but perhaps double-check this with John too. For reference, the two relevant entries from John's spreadsheet are:

Station ID	Station Code	Station name	Latitude	Longitude	INC_EXC
23709	US74	Mud Pond, Maine	44.6306	-65.0939	INCLUDE
37063	1E1-134	MUD POND	44.633	-68.0908	INCLUDE

NB2: There is a Willis Lake already in RESA2, with exactly the same geographic co-ordinates as the Willis Lake, Adirondacks highlighted in the above table. John's spreadsheet actually contains two versions of this site as well (shown in the table below) and both have valid station IDs, so I guess he must have found it duplicated somewhere in RESA2.

Station ID	Station Code	Station name	Latitude	Longitude	INC_EXC
23683	US48	Willis Lake, Adirondacks	43.3714	-74.2463	EXCLUDE
37003	050215O	WILLIS LAKE	43.3714	-74.2463	EXCLUDE

Regardless, both these are marked as EXCLUDE, so as long as I make sure both are removed from the project in RESA2 everything should be OK.

2.1. Exclude sites

Following Tore's advice, I've created a new project in RESA2 called ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED, which has project code 4150. I want to associate all the sites marked EXCLUDE in John's spreadsheet, plus the five sites listed above, with this new code. To do this, I first need a list of all the stations I want to move. In total, there should be 20 stations marked for exclusion (15 from John's spreadsheet and 5 from above), but as Willis Lake, Adirondacks (station code US48) is not currently linked to the project, I'm only expecting 19 matches from the database itself. Let's check.



In [7]:

    
# List of sites marked 'EXCLUDE' in John's spreadsheet
j_exc = j_df.query('INC_EXC == "EXCLUDE"')
j_exc_li = list(j_exc['Station ID'].values)

# List of sites already in RESA2 to exclude
r_exc = all_df[pd.isnull(all_df['Station ID'])]
r_exc_li = [int(i) for i in list(r_exc['station_id'].values)]

# Concatenate lists and convert to tuple
exc_tu = tuple(j_exc_li + r_exc_li)

# Extract matches from database
sql = ('SELECT * '
       'FROM resa2.projects_stations '
       'WHERE project_id = 3870 '
       'AND station_id IN %s' % str(exc_tu))

exc_df = pd.read_sql_query(sql, engine)

print 'Number of sites to exclude from US_LTM project: % s.' % len(exc_df)

exc_df









    



Number of sites to exclude from US_LTM project: 19.






    Out[7]:






  
    
      
      station_id
      project_id
      active
      station_code
      station_name
    
  
  
    
      0
      36982
      3870
      Y
      020058O
      None
    
    
      1
      36983
      3870
      Y
      020059O
      None
    
    
      2
      36999
      3870
      Y
      040874O
      None
    
    
      3
      37003
      3870
      Y
      050215O
      None
    
    
      4
      37027
      3870
      Y
      1A1-109O
      None
    
    
      5
      37028
      3870
      Y
      1A1-110O
      None
    
    
      6
      37041
      3870
      Y
      1C1-094
      None
    
    
      7
      37046
      3870
      Y
      1C1-104
      None
    
    
      8
      37047
      3870
      Y
      1C1-106
      None
    
    
      9
      37048
      3870
      Y
      1C1-108
      None
    
    
      10
      37050
      3870
      Y
      1C1-111
      None
    
    
      11
      37052
      3870
      Y
      1C2-071
      None
    
    
      12
      37053
      3870
      Y
      1C2-073
      None
    
    
      13
      37054
      3870
      Y
      1C2-074
      None
    
    
      14
      37055
      3870
      Y
      1C2-075
      None
    
    
      15
      37056
      3870
      Y
      1C3-064
      None
    
    
      16
      37057
      3870
      Y
      1C3-075
      None
    
    
      17
      37058
      3870
      Y
      1C3-076
      None
    
    
      18
      37070
      3870
      Y
      ME-4474
      None

So far so good. The final step is to use an update query to change the project code for these sites from 3870 to 4150. NB: This cannot be undone, and you should only run this code once. It will also make changes to the database, so all the code above here will return different results if you run it again in the future.



In [9]:

    
# Move sites to EXCLUDED project
sql = ('UPDATE resa2.projects_stations '
       'SET project_id=4150 '
       'WHERE project_id = 3870 '
       'AND station_id IN %s' % str(exc_tu))
  
# Only run the code below when you're sure it's correct and you're
# finished with the code above!
result = conn.execute(sql)

Having made these changes, there should now be fewer sites associated with the ICPW_TOCTRENDS_2015_US_LTM project. There were originally 90 sites in RESA2 and we've just excluded 19, so there should be 71 left. John's spreadsheet has 72 US sites on the "include" list, but this is OK because we haven't yet added Mud Pond, Maine - that's the next step.

2.2. Add a new station

John's spreadsheet includes a Station ID for Mud Pond, Maine (23709), which I assume he got from RESA2, so perhaps the site is actually in the database after all, but simply not associated with the ICPW_TOCTRENDS_2015_US_LTM project. Let's check.



In [36]:

    
# Search all sites in RESA2 for site ID 23709
sql = ('SELECT * '
       'FROM resa2.stations '
       'WHERE station_id = 23709')
df = pd.read_sql_query(sql, engine)

df









    Out[36]:






  
    
      
      station_id
      station_code
      station_name
      lake_or_river
      lake_river_name
      latitude
      longitude
      altitude
      map_sheet
      country_no
      ...
      catchment_peat_area
      catchment_water_area
      station_description
      entered_by
      entered_date
      blr
      old_id
      utmn
      utme
      zone
    
  
  
    
      0
      23709
      US74
      Mud Pond, Maine
      L
      None
      44.6306
      -65.0939
      104
      None
      None
      ...
      None
      None
      None
      TOH
      2011-03-28 09:16:33
      None
      None
      None
      None
      None
    
  

1 rows × 30 columns

OK, so the site is there (it looks as though Tore added it back in March 2011), it just hasn't been associated with the updated TOC trends analysis. This should be a fairly quick fix via Access. Try the following:

Open Access and make an ODBC connection to RESA2's PROJECTS_STATIONS table.
Add a new row linking station_id 23709 to project_id 3870.

With a bit of luck, if we now query the database for the list of stations associated with ICPW_TOCTRENDS_2015_US_LTM, we should get all 72 sites.



In [3]:

    
# Get the updated ICPW_TOCTRENDS_2015_US_LTM sites from RESA2
sql = ('SELECT * '
       'FROM resa2.projects_stations '
       'WHERE project_id = 3870')
r_df = pd.read_sql_query(sql, engine)

print 'Total number of US LTM sites in RESA2: %s.' % len(r_df)









    



Total number of US LTM sites in RESA2: 72.

Good. The final thing for this task is to make sure the site details in RESA2 match the information in John's spreadsheet. To check this we first need to extract the site details from RESA2.



In [38]:

    
# Get the site details from RESA2 for the 72 sites
sql = ('SELECT st.station_id, st.station_code, st.station_name, st.latitude, st.longitude '
       'FROM resa2.projects_stations ps, resa2.stations st '
       'WHERE ps.station_id=st.station_id '
       'AND ps.project_id=3870')
r_df = pd.read_sql_query(sql, engine)

r_df.head()









    Out[38]:






  
    
      
      station_id
      station_code
      station_name
      latitude
      longitude
    
  
  
    
      0
      36984
      X15:020138O
      EAST COPPERAS POND
      44.3119
      -74.3722
    
    
      1
      36985
      X15:020188E
      SUNDAY POND
      44.3447
      -74.3005
    
    
      2
      36986
      X15:020197E
      SOCHIA POND
      44.3522
      -74.2947
    
    
      3
      36987
      X15:020265O
      MARCY DAM POND
      44.1589
      -73.9530
    
    
      4
      36988
      X15:030171E
      GRASS POND
      44.6572
      -74.4958

And then join them to the information in John's spreadsheet, so we can check whether the columns agree.



In [39]:

    
# Join John's data to RESA2 data
us_df = pd.merge(r_df, j_df, how='left',
                 left_on='station_id', right_on='Station ID')

# Check columns match
print 'Station IDs match:       ', (us_df['station_id'] == us_df['Station ID']).all()
print 'Station codes match:     ', (us_df['station_code'] == us_df['Station Code']).all()
print 'Station names match:     ', (us_df['station_name'] == us_df['Station name']).all()
print 'Station latitudes match: ', np.allclose(us_df['latitude'], us_df['Latitude'])    # Within tolerance of 1E-8
print 'Station longitudes match:', np.allclose(us_df['longitude'], us_df['Longitude'])  # Within tolerance of 1E-8









    



Station IDs match:        True
Station codes match:      True
Station names match:      True
Station latitudes match:  True
Station longitudes match: True

2.3. Summary for US LTM sites

The RESA2 project ICPW_TOCTRENDS_2015_US_LTM now contains only the 72 sites marked INCLUDE in John's spreadsheet.
The station IDs, codes, names and geographic co-ordinates for these 72 sites in RESA2 are identical to those in John's spreadsheet. I haven't checked any of the other site properties yet, though - that's another task.

3. Site properties (location, land use and elevation)

The next main task is to make sure the site properties for all stations are as complete as possible. This is likely to be fiddly.

In particular, we're interested in having the following details for each site:

Geographic location, specified as latitude and longitude using the WGS84 GCS (ESPG 4326).
Mean catchment elevation (not just sampling point elevation).
Land cover proportions, using the original land cover classes where possible:
- Coniferous forest
- Deciduous forest
- Heathlands/scrub/shrub
- Grasslands
- Wetlands/peatlands
- Bare rock/barren
- Agriculture
- Water (excluding lakes)
- Water

NB: If separate coniferous and deciduous classes are not available, an aggregated forestry class is acceptable.

NB2: Wetlands and peatlands are not the same. Keep them separate for now, although they’ll probably be treated together in the final analysis. This means countries should report either wetlands or peatlands – not both.

NB3: Land use proportions should sum to 100% and there should be no blanks/data gaps. This may require the addition of an “Other” category to allow for the fact that not all land classes are covered by the classification scheme above. Where possible, I’ll try to assign values for any new categories explicitly (i.e. based on information from the focal centres) rather than implicitly (i.e. as 100% minus the sum of everything else).

I suspect filling in the missing information will require a bit of to-ing and fro-ing with the focal centres, which unfortunately could take a while.

The first step is to see what information we already have. The easiest way to do this is to extract all the relevant site details from RESA2 to Excel. The following RESA2 projects look like they should be relevant:

ICPW_TOCTRENDS_2015_CA_ATL
ICPW_TOCTRENDS_2015_CA_DO
ICPW_TOCTRENDS_2015_CA_ICPW
ICPW_TOCTRENDS_2015_CA_NF
ICPW_TOCTRENDS_2015_CA_QU
ICPW_TOCTRENDS_2015_CZ
ICPW_TOCTRENDS_2015_CZ2
ICPW_TOCTRENDS_2015_FL
ICPW_TOCTRENDS_2015_NO
ICPW_TOCTRENDS_2015_SE
ICPW_TOCTRENDS_2015_SE_RIVERS
ICPW_TOCTRENDS_2015_UK
ICPW_TOCTRENDS_2015_US_LTM
ICPW_TOCTRENDS_2015_US_TIME

Taken together, these projects are associated with 704 sites. This makes sense, because when I previously extracted data for matching sites between the 2006 and 2015 analyses (see notebook here), I found 722 sites for the 2015 analysis, including 90 sites for the US LTM project. Having now cleaned up the US sites (above), we have 72 sites in this project, and $722 - (90 - 72) = 704$.

However, before going any further, I'd like to agree on a definitive list of RESA2 projects for inclusion in the subsequent analysis. After an e-mail discussion with Heleen (e.g. 16/06/2016 at 16:53), we've decided to include the following 13 projects:

ICPW_TOCTRENDS_2015_CA_ATL
ICPW_TOCTRENDS_2015_CA_DO
ICPW_TOCTRENDS_2015_CA_ICPW
ICPW_TOCTRENDS_2015_CA_NF
ICPW_TOCTRENDS_2015_CA_QU
ICPW_TOCTRENDS_2015_CZ
ICPW_TOCTRENDS_2015_CZ2
ICPW_TOCTRENDS_2015_FL
ICPW_TOCTRENDS_2015_NO
ICPW_TOCTRENDS_2015_SE
ICPW_TOCTRENDS_2015_UK
ICPW_TOCTRENDS_2015_US_LTM
ICPWaters Ca

(The last of these is a project I've not looked at before, but hopefully it won't cause any new problems).

In total, the are 605 sites associated with these projects, and I've exported all their relevant properties to an Excel file called sites_2015_tidied.xlsx. Let's read the spreadsheet and see what's missing.



In [28]:

    
stn_xlsx = (r'\\niva-of5\osl-userdata$\JES\Documents\James_Work\Staff\Heleen_d_W'
            r'\ICP_Waters\TOC_Trends_Analysis_2015\Data\sites_2015_tidied.xlsx')
stn_df = pd.read_excel(stn_xlsx, sheetname='DATA')

# Replace spaces in header with underscores
stn_df.columns = [x.replace(' ', '_') for x in stn_df.columns]

stn_df.head()









    Out[28]:






  
    
      
      Station_ID
      Station_Code
      Station_name
      Country
      Latitude
      Longitude
      Altitude
      Catchment_mean_eleveation
      ECCO_ALTITUDE_AVERAGE
      Total_forest_area
      ...
      Corine_2006_NON_IRRIGATED_ARABLE_LAND
      Corine_2006_OCCUPIED_BY_AGRICULTURE
      Corine_2006_PASTURES_AREA
      Corine_2006_PEAT_BOGS
      Corine_2006_SCLEROPHYLLOUS_VEGETATION
      Corine_2006_SPARSELY_VEGETATED_AREAS
      Corine_2006_SPORT_AND_LEISURE_FACILITIES
      Corine_2006_TRANSITIONAL_WOODLAND_SHRUB
      Corine_2006_WATER_BODIES
      Corine_2006_WATER_COURSES
    
  
  
    
      0
      37017
      X15:1A1-052O
      ARBUTUS LAKE
      United States
      43.9875
      -74.2417
      513.0
      549.949859
      NaN
      74.751179
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      37043
      X15:1C1-097
      SUNSET
      United States
      42.9194
      -72.6833
      418.0
      447.748266
      NaN
      79.281882
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      37020
      X15:1A1-087S
      WINDFALL POND STREAM
      United States
      43.8051
      -74.8310
      591.0
      641.860023
      NaN
      65.429234
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      36991
      X15:040210O
      WILLYS LAKE (HORSESHO
      United States
      43.9722
      -74.9555
      639.0
      653.460774
      NaN
      79.828080
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      37061
      X15:1E1-132
      LITTLE LONG POND
      United States
      44.6375
      -68.0781
      69.0
      155.091943
      NaN
      75.103112
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 53 columns

3.1. Sites with missing geographic co-ordinates



In [14]:

    
# For details of the 'query NaN' hack used below, see:
# http://stackoverflow.com/questions/26535563/querying-for-nan-and-other-names-in-pandas

# Stns with missing co-ords
stn_df.query('(Latitude != Latitude) or (Longitude != Longitude)')









    Out[14]:






  
    
      
      Station_ID
      Station_Code
      Station_name
      Country
      Latitude
      Longitude
      Altitude
      Catchment_mean_eleveation
      ECCO_ALTITUDE_AVERAGE
      Total_forest_area
      ...
      Corine_2006_NON_IRRIGATED_ARABLE_LAND
      Corine_2006_OCCUPIED_BY_AGRICULTURE
      Corine_2006_PASTURES_AREA
      Corine_2006_PEAT_BOGS
      Corine_2006_SCLEROPHYLLOUS_VEGETATION
      Corine_2006_SPARSELY_VEGETATED_AREAS
      Corine_2006_SPORT_AND_LEISURE_FACILITIES
      Corine_2006_TRANSITIONAL_WOODLAND_SHRUB
      Corine_2006_WATER_BODIES
      Corine_2006_WATER_COURSES
    
  
  
    
      67
      36859
      UK_26
      Loch Coire Fionnaraich
      United Kingdom
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      365
      108
      LAE01
      Langtjern, utlÃ¸p
      Norway
      NaN
      NaN
      516.0
      NaN
      589.300347
      75.0
      ...
      NaN
      NaN
      NaN
      11.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

2 rows × 51 columns

Slightly ironically, Langtjern is one of the sites with missing co-ordinates! In reality, RESA2 does have location information for Langtjern, but it's specified using UTM Zone 33N (EPSG 32633) co-ordinates: (209388, 6704530). Converting these to WGS84 is easy enough, and is done by the code below.

Metadata for the UK sites can be found here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\2015 DOC analysis\data delivery\UK\copy of metadata UK from Don.xlsx

However, this file doesn't include details for Loch Coire Fionnaraich. Searching online, it seems that the loch is part of the UK Upland Water Monitoring Network, which states the site co-ordinates as latitude 57.4917, longitude -5.4306. Some information regarding land cover and altitude for this site can also be found at the UK Lakes Portal, which could be useful later. Contact Don to see whether this site should be included and, if so, check these data are correct.



In [11]:

    
# Project Langtjern co-ordinates to WGS84 lat/long
from pyproj import Proj, transform

# Initialise spatial refs.
nor_proj = Proj(init='epsg:32633')
wgs_proj = Proj(init='epsg:4326')

# Transform
lon, lat = transform(nor_proj, wgs_proj, 209388, 6704530)

print 'Langtjern lat:', lat
print 'Langtjern lon:', lon









    



Langtjern lat: 60.3724627229
Langtjern lon: 9.72665982842

I've used an Access ODBC connection to update the co-ordinates for these two sites in RESA2.

3.2. Sites with missing altitude information

The next step is to identify sites without mean catchment elevation data. There are two columns in the database recording this kind of information: ECCO_ALTITUDE_AVERAGE and Catchment_mean_eleveation (sic), but it looks as though quite a few sites have blanks in both these columns. Let's group the sites with missing data by country and count them.



In [24]:

    
# Stns with missing mean altitude
df = stn_df.query('(Catchment_mean_eleveation != Catchment_mean_eleveation) and '
                  '(ECCO_ALTITUDE_AVERAGE != ECCO_ALTITUDE_AVERAGE)')

df = df[['Station_ID', 'Station_Code', 'Station_name', 'Country']]

grpd = df.groupby(['Country'])
mis_elev = grpd.count()[['Station_ID']]
mis_elev.columns = ['Number of sites missing elevation data',]

print 'Total number of sites without average elevation data:', len(df)

mis_elev









    



Total number of sites without average elevation data: 131






    Out[24]:






  
    
      
      Number of sites missing elevation data
    
    
      Country
      
    
  
  
    
      Canada
      116
    
    
      Norway
      7
    
    
      Sweden
      2
    
    
      US
      1
    
    
      United Kingdom
      1
    
    
      United States
      4

(Aside: Note that one of the US sites appears to be labelled "US", whereas the others are all labelled "United States". I've corrected this in RESA2 via Access ODBC, so it shouldn't show up again).

Of the 605 sites, 131 have missing entries in both the Catchment_mean_eleveation (sic) and ECCO_ALTITUDE_AVERAGE columns. Consider each country in turn.

3.2.1. United Kingdom

The single UK site is Loch Coire Fionnaraich. As noted above, this does not appear in the spreadsheet of UK site metadata, so check with Don to see whether this site should be included and, if so, ask for the mean catchment elevation.

3.2.2. Norway

I assume NIVA will be able to supply catchment properties for the 7 Norwegian sites pretty easily. The sites in question are listed below. Contact Heleen to see whether she knows of elevation data (or catchment boundaries) for these locations (e-mail sent 20/06/2016 at 17:09).



In [26]:

    
# Norwegian stns with missing mean altitude
df = stn_df.query('(Catchment_mean_eleveation != Catchment_mean_eleveation) and '
                  '(ECCO_ALTITUDE_AVERAGE != ECCO_ALTITUDE_AVERAGE) and '
                  '(Country == "Norway")')
df = df[['Station_ID', 'Station_Code', 'Station_name', 'Country']]

df









    Out[26]:






  
    
      
      Station_ID
      Station_Code
      Station_name
      Country
    
  
  
    
      355
      12082
      DALELV
      Dalelv
      Norway
    
    
      369
      12080
      STE01
      Storgama v. dam
      Norway
    
    
      374
      12025
      OVELV 19 23
      Ã˜ygardsbekken
      Norway
    
    
      385
      129
      1859-601
      Storvatn
      Norway
    
    
      396
      114
      SVART01
      Svartetjernet
      Norway
    
    
      409
      12081
      KAE01
      KÃ¥rvatn feltforskningsstasjon
      Norway
    
    
      417
      221
      BIE01
      Birkenes
      Norway

3.2.3. Sweden

The two Swedish sites without elevation data are shown below. Searching the web, I can find find data download pages for Svinarydsjön and Gosjön, but neither of these seem to have catchment characteristics such as mean elevation. Salar is probably a good person to ask about this initially - e-mail sent 20/06/2016 at 17:09.



In [27]:

    
# Swedish stns with missing mean altitude
df = stn_df.query('(Catchment_mean_eleveation != Catchment_mean_eleveation) and '
                  '(ECCO_ALTITUDE_AVERAGE != ECCO_ALTITUDE_AVERAGE) and '
                  '(Country == "Sweden")')
df = df[['Station_ID', 'Station_Code', 'Station_name', 'Country']]

df









    Out[27]:






  
    
      
      Station_ID
      Station_Code
      Station_name
      Country
    
  
  
    
      293
      36804
      622803-144609
      Svinarydsjön
      Sweden
    
    
      330
      36757
      677506-156174
      Gosjön
      Sweden

3.2.4. United States

The five US sites with missing elevation information are shown below.



In [30]:

    
# US stns with missing mean altitude
df = stn_df.query('(Catchment_mean_eleveation != Catchment_mean_eleveation) and '
                  '(ECCO_ALTITUDE_AVERAGE != ECCO_ALTITUDE_AVERAGE) and '
                  '(Country == "United States")')
df = df[['Station_ID', 'Station_Code', 'Station_name', 'Country', 'Latitude', 'Longitude']]

df









    Out[30]:






  
    
      
      Station_ID
      Station_Code
      Station_name
      Country
      Latitude
      Longitude
    
  
  
    
      19
      37011
      X15:1364959
      RONDOUT CREEK
      United States
      41.9367
      -74.3764
    
    
      25
      37014
      X15:1434025
      BISCUIT BROOK
      United States
      42.0111
      -74.4147
    
    
      53
      37013
      X15:1434021
      W BR NEVERSINK R AT W
      United States
      41.9725
      -74.4483
    
    
      66
      37012
      X15:143400680
      EAST BRANCH NEVERSINK
      United States
      41.9869
      -74.5031
    
    
      68
      23709
      US74
      Mud Pond, Maine
      United States
      44.6306
      -65.0939

In his e-mail sent 23/05/2016 at 16:24, John sent me a file containing the mean elevation and land use proportions for 86 US sites. Subsequently, some of these sites have been excluded (see John's e-mail from 26/05/2016 at 16:35), but the original file nevertheless includes land use proportions and mean elevations for the first four sites in the table above. However, although the station codes appear to match, some of the geographic co-ordinates are slightly different. The table below shows the co-ordinates in John's LTM_WSHED_LANDCOVER.csv file, which can be compared to the values in the table above (which are from RESA2, and which also match the figures in John's US.sites.include.exclude.xlsx spreadsheet).

SITE_ID	LATDD	LONDD
1364959	41.9367	-74.3764
1434025	41.9869	-74.5031
1434021	42.0111	-74.4147
143400680	41.9725	-74.4483

The differences in co-ordinates are small (< 0.1 degrees of longitude, which is approximately 11 km), and they are probably due to people using to slightly different co-ordinate transformations from the original (projected?) spatial reference system into WGS84. Nevertheless, check with John before using this spreadsheet to update the elevation and land use information for the US LTM sites. Also need to ask John for elevation and land land use statistics for Mud Pond, Maine (US74). This station code doesn't appear in LTM_WSHED_LANDCOVER.csv (or, if it does, it's co-ordinates and station code are specified differently), so at present I don't have any land use or elevation data for this site.

3.2.5. Canada

The majority of the sites with missing data are Canadian (116 sites in total), and the metadata files we currently have for Canada are on the network here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\2015 DOC analysis\data delivery\CA

None of these files include catchment mean elevations, though (and most are missing land use information too - see below). Most of them do have the elevation of the sampling location, but in our meeting back in May Don didn't think this would be sufficient for the analysis.

Strangely, there are only 109 Canadian sites in the raw data folder on the network, so where have the other 7 come from?

Country	Region code	No. of sites	Contact
Canada	Dorset	8	Andrew Paterson
Canada	QC	37	Suzanne Couture
Canada	Atl	52	Suzanne Couture
Canada	NF	12	Suzanne Couture

Total		109

Sorting this out is likely to be fiddly. As a start, I've created a new Excel file (canadian_raw_vs_resa2.xlsx) and copied the raw data for the 109 sites into one sheet and the RESA2 data for the 116 sites into another. Unfortunately, the site codes, names and even co-ordinates are significantly different between the two sheets (!), so there's no robust way of matching them automatically. Linking by name using a simple VLOOKUP function in Excel returns 93 matches; doing the same using the site code (after stripping off the X15:) gets me another 5. The rest will need to be done manually. During this process I've discovered that MOUNT TOM LAKE appears twice in the raw data (in the file Site characteristics ICP Waters trend analysis_Atl_final.xlsx), so there are actually only 108 Canadian sites in the folder above.

The overall conclusion here is that I need to contact Suzanne and Andrew to see if they have mean catchment elevation data for all their sites. In addition, there are another 8 Canadian sites $(= 116 - 108)$ where I don't know who supplied the original data. These sites are listed below. E-mail Heleen (20/06/2016 at 17:09) to ask if she knows where these data came from originally.



In [32]:

    
can_xlsx = (r'\\niva-of5\osl-userdata$\JES\Documents\James_Work\Staff\Heleen_d_W'
            r'\ICP_Waters\TOC_Trends_Analysis_2015\Data\canadian_raw_vs_resa2.xlsx')

can_df = pd.read_excel(can_xlsx, sheetname='RESA2_CA')
can_df = can_df[['Station ID', 'Strip_X15_Code', 'Station name', 
                 'Latitude', 'Longitude', 'In_Raw']]
can_df.query('In_Raw == 0')









    Out[32]:






  
    
      
      Station ID
      Strip_X15_Code
      Station name
      Latitude
      Longitude
      In_Raw
    
  
  
    
      100
      23467
      CA17
      Ontario,Lake239
      49.707
      -93.720
      0
    
    
      101
      23466
      CA16
      Ontario,Lake224
      49.658
      -93.717
      0
    
    
      107
      23453
      CA02
      Ontario, Algoma Region, Wishart Lake
      47.041
      -84.402
      0
    
    
      108
      23452
      CA01
      Ontario, Algoma Region, Batchawana Lake
      47.060
      -84.393
      0
    
    
      109
      23454
      CA03
      Ontario, Algoma Region, Little Turkey Lake
      47.041
      -84.406
      0
    
    
      110
      23469
      CA20
      Ontario,Lake373
      49.740
      -93.793
      0
    
    
      112
      23468
      CA19
      Ontario. Lake 305
      49.689
      -93.686
      0
    
    
      113
      23455
      CA04
      Ontario, Algoma Region, Turkey Lake
      47.050
      -84.408
      0

3.3. Sites with missing land use information

It seems as though different countires have reported their land use in different ways, so I suspect this is going to be complicated. The best approach is probably going to be to consider each country in turn.

3.3.1. Canada

There is a small amount of land use information present for the Canadian sites, but it's very patchy and the values don't add up to anything like 100%. As I need to ask for elevation information for all of these sites anyway, it's probably a good idea to ask for land use at the same time.

3.3.2. Czech Republic

All the land use proportions add to 100%. Woohoo!

3.3.3. Finland

Finland seems to have provided fairly comprehensive land use information, but the values rarely add up to 100% (108% in one case!). We can either query this with the Finns, or just assume the remaining land belongs to an "Other" category (not sure what to do with the 108% site though). Ask Heleen what she'd like to do.

3.3.4. Norway

Like Finland, the Norwegian data seems rather patchy and rarely sums to 100%. Ask Heleen where these values came from originally and whether she'd like to refine/improve them.

3.3.5. Poland

The Polish data is a mystery. Total land cover percentages range from 113% to 161% and the values for Grassland are identical to those for Wetland. Even allowing for this duplication, the numbers still don't make much sense. I also can't find the raw data on the network anywhere - see if Heleen knows where the values come from?

3.3.6. Slovenia

The situation for Slovenia is very similar to Poland, with total land cover percentages well over 100%. Also, just like Poland, the Grassland and Wetland percentages are identical, which is obviously an error. This strongly suggests some kind of problem in the data upload procedure, but the issue isn't present for all countries. I'm not sure what's going on here yet, but this is definitely something to watch out for.

Actually, I'm a bit surprised Slovenia is in here at all - it's not one of the countries listed in project codes at the start of section 3. Nor is Poland for that matter! After a quick check in RESA2, it seems that Poland and Slovenia are both grouped under the Czech Republic project. Hmm.

As with Poland, I'm struggling to find the original metadata for the Slovenian sites. Ask Heleen to see if she can point me in the right direction.

3.3.7. Sweden

The Swedish data look fairly complete, although there's no distinction made between deciduous and coniferous forestry (there's just an aggregated Total forest class. Most of the land use proportions add to somewhere close to 100%, but there are about 45 sites where the total is less than 90% and 19 sites where it's less than 50%. Some of these probably need following up. Ask Salar if he knows where to get land use proportions for Swedish sites.

3.3.8. United Kingdom

Land use proportions for all but two of the UK sites sum to exactly 100%. It's late on a Friday afternoon, and this simple result makes me feel so grateful I could almost weep. It must be nearly home time!

The two sites that need correcting are:

Loch Coire Fionnaraich (station code UK_26), which has no land use (or elevation) data at all yet (but information is available from the UK Lakes Portal), and
Llyn Llagi (station code UK_15), where the summed proportions equal 105%. This error seems to stem from the raw metadata for the UK sites, which can be found here:

K:\Prosjekter\langtransporterte forurensninger\O-23300 - ICP-WATERS - HWI\Database\2015 DOC analysis\data delivery\UK\copy of metadata UK from Don.xlsx

Ask Don if the data from the UK Lakes Portal is suitable and also whether he can supply corrected proportions for Llyn Llagi.

3.3.9. United States

The land use proportions for the US sites are almost complete and they sum to (very close to) 100%, which is great. There are two things to check:

The sum of the Deciduous and Coniferous forestry classes is often significantly less than for the Total forest class. The overall land use proportions only add to 100% if the Total forest class is used, rather than considering deciduous and coniferous classes separately.
Land use proportions are missing for 5 US sites. These are the same sites identified above as having missing elevation data, but for completeness they are shown again in the table below.



In [33]:

    
# US stns with missing mean altitude
df = stn_df.query('(Catchment_mean_eleveation != Catchment_mean_eleveation) and '
                  '(ECCO_ALTITUDE_AVERAGE != ECCO_ALTITUDE_AVERAGE) and '
                  '(Country == "United States")')
df = df[['Station_ID', 'Station_Code', 'Station_name', 'Country', 'Latitude', 'Longitude']]

df









    Out[33]:






  
    
      
      Station_ID
      Station_Code
      Station_name
      Country
      Latitude
      Longitude
    
  
  
    
      19
      37011
      X15:1364959
      RONDOUT CREEK
      United States
      41.9367
      -74.3764
    
    
      25
      37014
      X15:1434025
      BISCUIT BROOK
      United States
      42.0111
      -74.4147
    
    
      53
      37013
      X15:1434021
      W BR NEVERSINK R AT W
      United States
      41.9725
      -74.4483
    
    
      66
      37012
      X15:143400680
      EAST BRANCH NEVERSINK
      United States
      41.9869
      -74.5031
    
    
      68
      23709
      US74
      Mud Pond, Maine
      United States
      44.6306
      -65.0939

John's file LTM_WSHED_LANDCOVER.csv contains proportions for the first 4 of these, but I'm still lacking land cover data for Mud Pond, Maine (station code US74). Ask John for this data and also double check forestry proportions.

4. Distance to coast

During the meeting on 27/05/2016, Don and John thought it would be helpful to know the distance from each site to the nearest piece of coastline (in any direction). In Euclidean co-ordinates this is a simple calculation, but at intercontinental scale using WGS84 we need to perform geodesic calculations on the ellipsoid, which are more complicated. The most recent versions of ArcGIS introduced a method='GEODESIC' option for exacltly this situation, but this software isn't yet available at NIVA. Major spatial database platforms such as Oracle Spatial and PostGIS also include this functionality, but the NIVA Oracle instance isn't spatially enabled and setting up a PostGIS server just for this calculation seems a bit over the top. I can think of two ways of doing this:

Code my own solution using the Vincenty formula. This is not too difficult in principle, but optimising it so that it runs efficiently might prove difficult.
Use SpatiaLite, which is the spatially enabled version of SQLite. This should be much simpler to setup than PostGIS, but last time I used SpatiaLite for geodesic calculations (several years ago) I had some reservations about the accuracy of the output. These issues have likely been fixed in the meantime, but it would be good to check.

I think the best way is to perfrom the calculation using SpatiaLite and then check the results are sensible using my own code as well. I can also compare my output to John's results for the 2006 analysis, althoigh I think John may have done something a bit more sophisticated, such as measuring distances in a particular direction depending on the prevailing weather.

4.1. SpatiaLite

Start off by downloading and installing SpatiaLite and the SpatiaLite GUI from here. Then obtain a medium resolution (1:50 million scale) global coastline dataset in WGS84 co-ordinates from Natural Earth and, using ArcGIS, merge each of the separate coastline features into a single MULTILINESTRING object. Next, create a point shapefile (using the WGS84 spatial reference system) from the latitudes and longitudes in RESA2. Import both these shapefiles into a new SpatiaLite database called dist_to_coast.sqlite.

The SpatiLite SQL code below calculates the distance from each station to the nearest point on the coast.

CREATE TABLE distances AS
SELECT 
  s.Station_ID AS station_id, 
  s.Station_Co AS station_code, 
  s.Station_na AS station_name, 
  s.Country AS country, 
  ST_DISTANCE(c.Geometry, s.Geometry, 1) AS distance_m
FROM ne_50m_coastline AS c, toc_2015_sites AS s;

This table has been exported as distance_to_coast.csv.



In [2]:

    
# Read distances to coastline table
dist_csv = (r'\\niva-of5\osl-userdata$\JES\Documents\James_Work\Staff\Heleen_d_W'
            r'\ICP_Waters\TOC_Trends_Analysis_2015\Data\distance_to_coast.csv')
dist_df = pd.read_csv(dist_csv)

dist_df.head()









    Out[2]:






  
    
      
      station_id
      station_code
      station_name
      country
      distance_m
    
  
  
    
      0
      37017
      X15:1A1-052O
      ARBUTUS LAKE
      United States
      119943.635281
    
    
      1
      37043
      X15:1C1-097
      SUNSET
      United States
      180382.850457
    
    
      2
      37020
      X15:1A1-087S
      WINDFALL POND STREAM
      United States
      133560.750095
    
    
      3
      36991
      X15:040210O
      WILLYS LAKE (HORSESHO
      United States
      116305.747093
    
    
      4
      37061
      X15:1E1-132
      LITTLE LONG POND
      United States
      16462.380621

4.2. The Vincenty formula

To check the results above, I've converted the Natural Earth coastline into a set of points spaced at 0.05 degree intervals. I've then deleted all the points that are very obviously far away from our stations of interest (e.g. Antarctica) and use the Add XY Co-ordinates tool to calculate the latitude and longitude for each of these points. This leaves me with 21,718 coastline points in total. A very naive and inefficent way to estimate the nearest point on the coast is therefore to calcualte the distances from each station to all 21,718 coastal points, and then choose the smallest value in each case. This is incredibly badly optimised and requires $605 * 21,718 = 13,139,390$ calculations, but I can let it run while I do other things and it should provide a useful check on the (hopefully more accurate) results from SpatiaLite.

Begin by reading the site co-ordinates and coastal point co-ordinates into two dataframes, and then loop over all combinations.



In [4]:

    
from geopy.distance import vincenty

# File paths
coast_path = (r'\\niva-of5\osl-userdata$\JES\Documents\James_Work\Staff\Heleen_d_W'
              r'\ICP_Waters\TOC_Trends_Analysis_2015\Data\coast_points.csv')

site_path = (r'\\niva-of5\osl-userdata$\JES\Documents\James_Work\Staff\Heleen_d_W'
             r'\ICP_Waters\TOC_Trends_Analysis_2015\Data\sites_2015_locs.csv')

# Read co-ords
coast_df = pd.read_csv(coast_path)
coast_df = coast_df[['POINT_Y', 'POINT_X']]
coast_df.columns = ['lat', 'lon']

site_df = pd.read_csv(site_path, delimiter=';')

# List of min distances
min_dists = []

# Loop over sites
for site in site_df.itertuples():
    s_lat, s_lon = site.Latitude, site.Longitude
    
    # Pre-allocate array for distances
    dists = np.zeros(shape=(len(coast_df)))
    
    # Loop over coast
    for cst_pt in coast_df.itertuples():
        # Co-ords on coast
        c_lat, c_lon = cst_pt.lat, cst_pt.lon
    
        # Index for array
        idx = cst_pt.Index
        
        # Insert into array
        dists[idx] = vincenty((s_lat, s_lon), 
                              (c_lat, c_lon)).meters
    
    
    # Append min distance
    min_dists.append(dists.min())
    
# Add distance column to site_df
site_df['distance_m'] = np.array(min_dists)

site_df.head()









    Out[4]:






  
    
      
      Station ID
      Station Code
      Station name
      Country
      Latitude
      Longitude
      distance_m
    
  
  
    
      0
      37017
      X15:1A1-052O
      ARBUTUS LAKE
      United States
      43.9875
      -74.2417
      118967.318700
    
    
      1
      37043
      X15:1C1-097
      SUNSET
      United States
      42.9194
      -72.6833
      149850.928390
    
    
      2
      37020
      X15:1A1-087S
      WINDFALL POND STREAM
      United States
      43.8051
      -74.8310
      134266.908116
    
    
      3
      36991
      X15:040210O
      WILLYS LAKE (HORSESHO
      United States
      43.9722
      -74.9555
      117156.380590
    
    
      4
      37061
      X15:1E1-132
      LITTLE LONG POND
      United States
      44.6375
      -68.0781
      16284.663233

4.3. Comparison

We can now compare the two distance estimates. They will not be exactly the same due to the way I've discretised the coastline for the Vincenty estimate, but they should be similar.



In [28]:

    
# Compare Vincenty and SpatiaLite distance estimates

# SpatiaLite data
df1 = dist_df[['station_id', 'distance_m']]
df1.columns =['stn_id', 'spatialite']

# Vincenty data
df2 = site_df[['Station ID', 'distance_m']]
df2.columns =['stn_id', 'vincenty']

# Combine
df = pd.merge(df1, df2, how='outer', 
              left_on='stn_id', right_on='stn_id')

# Convert to km
df['vincenty'] = df['vincenty'] / 1000.
df['spatialite'] = df['spatialite'] / 1000.

# Plot
fig, ax = plt.subplots()

# Get labels for points
labs = list(df['stn_id'].values)
labs = [str(i) for i in labs]

# Scatter plot
scatter = ax.scatter(df['vincenty'].values, df['spatialite'].values)
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labs)
mpld3.plugins.connect(fig, tooltip)
ax.set_title("Distance to coast")
ax.set_xlabel("Vincenty distance (km)")
ax.set_ylabel("SpatiaLite distance (km)")
mpld3.display()









    Out[28]:

The plot above is "interactive": hovering the mouse over the plot area should display pan and zoom tools near the plot's bottom left hand corner, and hovering the mouse over any point will display the station ID for that location. These tools can be used to explore the plot. Most of the distance estimates are essentially identical, but a few are significnatly different. Lets plot those where the difference is large (> 75 km).



In [12]:

    
from mpl_toolkits.basemap import Basemap

# Get sites where distance to coast estimates differ by > 100 km
df['diff'] = np.absolute(df['spatialite'] - df['vincenty'])

big_dif = df.query('diff > 75')

big_dif = pd.merge(big_dif, site_df, how='left',
                  left_on='stn_id', right_on='Station ID')

fig = plt.figure(figsize=(10, 10))

# Use a basic Mercator projection
m = Basemap(projection='merc',
            llcrnrlat=50,
            urcrnrlat=70,
            llcrnrlon=0,
            urcrnrlon=40, 
            lat_ts=20, 
            resolution='i')

# Add map components
m.fillcontinents (color='darkkhaki', lake_color='darkkhaki')
m.drawcountries(linewidth=1)
m.drawmapboundary(fill_color='paleturquoise')

# Map (long, lat) to (x, y) for plotting
x, y = m(big_dif['Longitude'].values, big_dif['Latitude'].values)

# Plot
plt.plot(x, y, 'or', markersize=5)









    Out[12]:





[<matplotlib.lines.Line2D at 0x24686748>]

The sites with very large errors are mostly located well inland, approximately equidistant from two possible coastlines. The methods used to calculate these distances are numerical (i.e. iterative), so I suspect the differences I'm seeing are associated with convergence of the algorithms, combined with my coarse discretisation of the coastline in the case of Vincenty.

For now, I'll use the estimates from SpatiaLite, as these should be more accurate.

4.4. Comparison to John's 2006 estimates

In his e-mail from 27/05/2016 at 14:42, John sent me the "distance to coast" dataset he created for the 2006 analysis. It's not easy to match sites between 2006 and 2015, but for those locations with a common site code we can easily compare the three estimates.



In [30]:

    
# Read John's distance data
j_dist_path = (r'\\niva-of5\osl-userdata$\JES\Documents\James_Work\Staff\Heleen_d_W'
               r'\ICP_Waters\TOC_Trends_Analysis_2015\Data\Distance.from.coast.data.from.2006.xlsx')
j_df = pd.read_excel(j_dist_path, sheetname='Sheet1')

# Join to the other estimates
df2 = pd.merge(df, j_df, how='inner',
               left_on='stn_id', right_on='SITE_NO')
df2 = df2[['spatialite', 'vincenty', 'DIST_COAST']]
df2.columns = ['spatialite', 'vincenty', 'john']
df2.dropna(how='any', inplace=True)

# Scatterplot matrix
sn.pairplot(df2, kind='reg', diag_kind='kde')









    Out[30]:





<seaborn.axisgrid.PairGrid at 0x25446eb8>

Unsurprisingly, the distances aren't identical, but they do at least show a clear positive relationship. In general, John's estimates are larger than those from SpatiaLite or Vincenty, which probably reflects the fact that John calculated distances in a particular direction, rather than just going for the closest point on the coast as I've done here.

Overall, I think these distance estimates seem reasonable, add a new site parameter to RESA2 and upload the SpatiaLite values calculated above to the database.

	project_id	project_number	project_name
0	1130	None	FJORDA
1	2945	O 29010 06	Otra sidebekker
2	2965	O 29264 02	VRD-Telemark-B�rsesj�
3	1133	None	FLAGSTAD
4	1135	None	FLORID
5	1140	None	FROMLAB
6	1141	None	FROYSET
7	1142	None	GAULA
8	1143	None	GJELLESOGN
9	1144	None	GPVEST

	station_id	project_id	active	station_code	station_name
0	36982	3870	Y	020058O	None
1	36983	3870	Y	020059O	None
2	36984	3870	Y	020138O	None
3	36985	3870	Y	020188E	None
4	36986	3870	Y	020197E	None

	station_id	project_id	active	station_code	station_name	Station ID	Station Code	Station name	Latitude	Longitude	INC_EXC
16	36999	3870	Y	040874O	None	NaN	NaN	NaN	NaN	NaN	NaN
44	37027	3870	Y	1A1-109O	None	NaN	NaN	NaN	NaN	NaN	NaN
45	37028	3870	Y	1A1-110O	None	NaN	NaN	NaN	NaN	NaN	NaN
64	37047	3870	Y	1C1-106	None	NaN	NaN	NaN	NaN	NaN	NaN
85	37070	3870	Y	ME-4474	None	NaN	NaN	NaN	NaN	NaN	NaN

	station_id	project_id	active	station_code	station_name	Station ID	Station Code	Station name	Latitude	Longitude	INC_EXC
16	36999.0	3870.0	Y	040874O	None	NaN	NaN	NaN	NaN	NaN	NaN
44	37027.0	3870.0	Y	1A1-109O	None	NaN	NaN	NaN	NaN	NaN	NaN
45	37028.0	3870.0	Y	1A1-110O	None	NaN	NaN	NaN	NaN	NaN	NaN
64	37047.0	3870.0	Y	1C1-106	None	NaN	NaN	NaN	NaN	NaN	NaN
85	37070.0	3870.0	Y	ME-4474	None	NaN	NaN	NaN	NaN	NaN	NaN
90	NaN	NaN	NaN	NaN	NaN	23683.0	US48	Willis Lake, Adirondacks	43.3714	-74.2463	EXCLUDE
91	NaN	NaN	NaN	NaN	NaN	23709.0	US74	Mud Pond, Maine	44.6306	-65.0939	INCLUDE

	station_id	station_code	station_name	latitude	longitude
0	36984	X15:020138O	EAST COPPERAS POND	44.3119	-74.3722
1	36985	X15:020188E	SUNDAY POND	44.3447	-74.3005
2	36986	X15:020197E	SOCHIA POND	44.3522	-74.2947
3	36987	X15:020265O	MARCY DAM POND	44.1589	-73.9530
4	36988	X15:030171E	GRASS POND	44.6572	-74.4958

	Station_ID	Station_Code	Station_name	Country	Latitude	Longitude	Altitude	Catchment_mean_eleveation	ECCO_ALTITUDE_AVERAGE	Total_forest_area	...	Corine_2006_NON_IRRIGATED_ARABLE_LAND	Corine_2006_OCCUPIED_BY_AGRICULTURE	Corine_2006_PASTURES_AREA	Corine_2006_PEAT_BOGS	Corine_2006_SCLEROPHYLLOUS_VEGETATION	Corine_2006_SPARSELY_VEGETATED_AREAS	Corine_2006_SPORT_AND_LEISURE_FACILITIES	Corine_2006_TRANSITIONAL_WOODLAND_SHRUB	Corine_2006_WATER_BODIES	Corine_2006_WATER_COURSES
0	37017	X15:1A1-052O	ARBUTUS LAKE	United States	43.9875	-74.2417	513.0	549.949859	NaN	74.751179	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	37043	X15:1C1-097	SUNSET	United States	42.9194	-72.6833	418.0	447.748266	NaN	79.281882	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	37020	X15:1A1-087S	WINDFALL POND STREAM	United States	43.8051	-74.8310	591.0	641.860023	NaN	65.429234	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	36991	X15:040210O	WILLYS LAKE (HORSESHO	United States	43.9722	-74.9555	639.0	653.460774	NaN	79.828080	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	37061	X15:1E1-132	LITTLE LONG POND	United States	44.6375	-68.0781	69.0	155.091943	NaN	75.103112	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Station_ID	Station_Code	Station_name	Country	Latitude	Longitude	Altitude	Catchment_mean_eleveation	ECCO_ALTITUDE_AVERAGE	Total_forest_area	...	Corine_2006_NON_IRRIGATED_ARABLE_LAND	Corine_2006_OCCUPIED_BY_AGRICULTURE	Corine_2006_PASTURES_AREA	Corine_2006_PEAT_BOGS	Corine_2006_SCLEROPHYLLOUS_VEGETATION	Corine_2006_SPARSELY_VEGETATED_AREAS	Corine_2006_SPORT_AND_LEISURE_FACILITIES	Corine_2006_TRANSITIONAL_WOODLAND_SHRUB	Corine_2006_WATER_BODIES	Corine_2006_WATER_COURSES
67	36859	UK_26	Loch Coire Fionnaraich	United Kingdom	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
365	108	LAE01	Langtjern, utlÃ¸p	Norway	NaN	NaN	516.0	NaN	589.300347	75.0	...	NaN	NaN	NaN	11.0	NaN	NaN	NaN	NaN	NaN	NaN

	Station_ID	Station_Code	Station_name	Country
355	12082	DALELV	Dalelv	Norway
369	12080	STE01	Storgama v. dam	Norway
374	12025	OVELV 19 23	Ã˜ygardsbekken	Norway
385	129	1859-601	Storvatn	Norway
396	114	SVART01	Svartetjernet	Norway
409	12081	KAE01	KÃ¥rvatn feltforskningsstasjon	Norway
417	221	BIE01	Birkenes	Norway

	Station_ID	Station_Code	Station_name	Country
293	36804	622803-144609	Svinarydsjön	Sweden
330	36757	677506-156174	Gosjön	Sweden

	Station_ID	Station_Code	Station_name	Country	Latitude	Longitude
19	37011	X15:1364959	RONDOUT CREEK	United States	41.9367	-74.3764
25	37014	X15:1434025	BISCUIT BROOK	United States	42.0111	-74.4147
53	37013	X15:1434021	W BR NEVERSINK R AT W	United States	41.9725	-74.4483
66	37012	X15:143400680	EAST BRANCH NEVERSINK	United States	41.9869	-74.5031
68	23709	US74	Mud Pond, Maine	United States	44.6306	-65.0939

	Station ID	Strip_X15_Code	Station name	Latitude	Longitude
100	23467	CA17	Ontario,Lake239	49.707	-93.720
101	23466	CA16	Ontario,Lake224	49.658	-93.717
107	23453	CA02	Ontario, Algoma Region, Wishart Lake	47.041	-84.402
108	23452	CA01	Ontario, Algoma Region, Batchawana Lake	47.060	-84.393
109	23454	CA03	Ontario, Algoma Region, Little Turkey Lake	47.041	-84.406
110	23469	CA20	Ontario,Lake373	49.740	-93.793
112	23468	CA19	Ontario. Lake 305	49.689	-93.686
113	23455	CA04	Ontario, Algoma Region, Turkey Lake	47.050	-84.408