IS 608 HW 4

Using the dataset available here complete the following:

  1. Create lists & graphs of the best and worst places to swim in the dataset.
  2. The testing of water quality can be sporadic. Which sites have been tested most regularly? Which ones have long gaps between tests? Pick out 5-10 sites and visually compare how regularly their water quality is tested.
  3. Is there a relationship between the amount of rain and water quality? Show this relationship graphically. If you can, estimate the effect of rain on quality at different sites and create a visualization to compare them.

Background: The United States Environmental Protection Agency (EPA) reports Entero counts as colonies (or cells) per 100 ml of water. The federal standard for unacceptable water quality is a single sample value of greater than 110 Enterococcus/100mL, or five or more samples with a geometric mean (a weighted average) greater than 30 Enterococcus/100mL.


In [190]:
# Import modules for analysis and visualization
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats.mstats import gmean
import numpy as np
from numpy import genfromtxt
import os.path
from datetime import datetime

In [198]:
# Clean up and import (pandas)
data_url = "https://raw.githubusercontent.com/jlaurito/CUNY_IS608/master/lecture4/data/riverkeeper_data_2013.csv"
raw_data = pd.read_csv(data_url)
# info like what you get with str() in R
raw_data.dtypes


Out[198]:
Site                 object
Date                 object
EnteroCount          object
FourDayRainTotal    float64
SampleCount           int64
dtype: object

In [199]:
# look at the first 25 rows
# remember the first row is the colnames
raw_data.head(25)


Out[199]:
Site Date EnteroCount FourDayRainTotal SampleCount
0 Hudson above Mohawk River 10/16/2011 1733 1.5 35
1 Hudson above Mohawk River 10/21/2013 4 0.2 35
2 Hudson above Mohawk River 9/21/2013 20 0.0 35
3 Hudson above Mohawk River 8/19/2013 6 0.0 35
4 Hudson above Mohawk River 7/21/2013 31 0.0 35
5 Hudson above Mohawk River 6/4/2013 238 1.2 35
6 Hudson above Mohawk River 10/15/2012 23 1.4 35
7 Hudson above Mohawk River 9/15/2012 11 0.1 35
8 Hudson above Mohawk River 8/18/2012 15 0.3 35
9 Hudson above Mohawk River 7/21/2012 6 0.2 35
10 Hudson above Mohawk River 6/16/2012 10 0.2 35
11 Hudson above Mohawk River 5/20/2012 11 0.0 35
12 Hudson above Mohawk River 6/24/2013 30 1.4 35
13 Hudson above Mohawk River 9/19/2011 11 0.1 35
14 Hudson above Mohawk River 8/21/2011 231 0.4 35
15 Hudson above Mohawk River 7/14/2011 11 0.3 35
16 Hudson above Mohawk River 7/2/2011 11 2.1 35
17 Hudson above Mohawk River 5/19/2011 91 1.6 35
18 Hudson above Mohawk River 10/16/2010 >2420 1.3 35
19 Hudson above Mohawk River 9/14/2010 15 0.0 35
20 Hudson above Mohawk River 8/21/2010 25 0.2 35
21 Hudson above Mohawk River 7/19/2010 31 0.7 35
22 Hudson above Mohawk River 5/22/2008 12 0.2 35
23 Hudson above Mohawk River 5/25/2010 <1 0.0 35
24 Hudson above Mohawk River 10/23/2009 4 0.0 35

In [200]:
# Adjust the data types to make the data easier to work with
raw_data["Site"] = raw_data["Site"].astype('category') # like facets
raw_data["Date"] = pd.to_datetime(raw_data["Date"]) # dates as dates
# We need to remove the greater than and less than symbols and treat "EnteroCount" as an integer -- 'int64'
raw_data["EnteroCount"] = raw_data["EnteroCount"].str.lstrip('><')
raw_data["EnteroCount"] = raw_data["EnteroCount"].astype('int64')

In [201]:
# Check to make sure the changes are correct
raw_data.dtypes


Out[201]:
Site                      category
Date                datetime64[ns]
EnteroCount                  int64
FourDayRainTotal           float64
SampleCount                  int64
dtype: object

In [202]:
raw_data.head(20) # there are no more < or > symbols


Out[202]:
Site Date EnteroCount FourDayRainTotal SampleCount
0 Hudson above Mohawk River 2011-10-16 1733 1.5 35
1 Hudson above Mohawk River 2013-10-21 4 0.2 35
2 Hudson above Mohawk River 2013-09-21 20 0.0 35
3 Hudson above Mohawk River 2013-08-19 6 0.0 35
4 Hudson above Mohawk River 2013-07-21 31 0.0 35
5 Hudson above Mohawk River 2013-06-04 238 1.2 35
6 Hudson above Mohawk River 2012-10-15 23 1.4 35
7 Hudson above Mohawk River 2012-09-15 11 0.1 35
8 Hudson above Mohawk River 2012-08-18 15 0.3 35
9 Hudson above Mohawk River 2012-07-21 6 0.2 35
10 Hudson above Mohawk River 2012-06-16 10 0.2 35
11 Hudson above Mohawk River 2012-05-20 11 0.0 35
12 Hudson above Mohawk River 2013-06-24 30 1.4 35
13 Hudson above Mohawk River 2011-09-19 11 0.1 35
14 Hudson above Mohawk River 2011-08-21 231 0.4 35
15 Hudson above Mohawk River 2011-07-14 11 0.3 35
16 Hudson above Mohawk River 2011-07-02 11 2.1 35
17 Hudson above Mohawk River 2011-05-19 91 1.6 35
18 Hudson above Mohawk River 2010-10-16 2420 1.3 35
19 Hudson above Mohawk River 2010-09-14 15 0.0 35

Create lists & graphs of the best and worst places to swim in the dataset.


In [7]:
# Create a column showing whether or not the water quality is acceptable in a given place
# unnaceptable: 
# 110 Enterococcus/100mL OR
# five or more samples with a geometric mean (a weighted average) > 30 Enterococcus/100mL.

# The prompt is really vague in defining the geometric mean condition - I used gmean()

# When calculated, geometric mean by site of highest 5 samples and single sample >= 110
# shows nearly all sites would have water with unacceptable water quality at some point

raw_data['swim'] = np.where(
    (raw_data.groupby('Site').EnteroCount.transform(max) > 110) 
    | (raw_data.groupby('Site').EnteroCount.transform(lambda group: gmean(group.nlargest(5))) > 30), 
    'unacceptable', 'acceptable')
raw_data.head(10)


Out[7]:
Site Date EnteroCount FourDayRainTotal SampleCount swim
0 Hudson above Mohawk River 2011-10-16 1733 1.5 35 unacceptable
1 Hudson above Mohawk River 2013-10-21 4 0.2 35 unacceptable
2 Hudson above Mohawk River 2013-09-21 20 0.0 35 unacceptable
3 Hudson above Mohawk River 2013-08-19 6 0.0 35 unacceptable
4 Hudson above Mohawk River 2013-07-21 31 0.0 35 unacceptable
5 Hudson above Mohawk River 2013-06-04 238 1.2 35 unacceptable
6 Hudson above Mohawk River 2012-10-15 23 1.4 35 unacceptable
7 Hudson above Mohawk River 2012-09-15 11 0.1 35 unacceptable
8 Hudson above Mohawk River 2012-08-18 15 0.3 35 unacceptable
9 Hudson above Mohawk River 2012-07-21 6 0.2 35 unacceptable

In [18]:
# As a result of above criteria not being very useful, I instead grouped the data by site and 
# calculated the mean entero_count for each site

mean_entero = raw_data.groupby(['Site'])['EnteroCount'].mean().sort_values()

In [19]:
# Top ten best places to swim
print "Ten best places to swim"
print "+++++++++++++++++++++++"
mean_entero.head(10)


Ten best places to swim
+++++++++++++++++++++++
Out[19]:
Site
Poughkeepsie Drinking Water Intake     8.342105
Croton Point Beach                    15.458333
Stony Point mid-channel               17.340909
Little Stony Point                    17.526316
Poughkeepsie Launch Ramp              17.675676
Haverstraw Bay mid-channel            18.708333
TZ Bridge mid-channel                 21.438596
Cold Spring Harbor                    22.542857
Yonkers mid-channel                   25.019231
Irvington Beach                       28.805556
Name: EnteroCount, dtype: float64

In [20]:
# Ten worst places to swim
print "Ten worst places to swim"
print "++++++++++++++++++++++++"
mean_entero.tail(10).sort_values(ascending=False)


Ten worst places to swim
++++++++++++++++++++++++
Out[20]:
Site
Gowanus Canal                              4206.837838
Newtown Creek- Metropolitan Ave. Bridge    2953.684211
Tarrytown Marina                           2205.666667
Saw Mill River                             1455.760000
Upper Sparkill Creek                       1296.072727
Newtown Creek- Dutch Kills                 1205.087719
Kingsland Pt. Park- Pocantico River         907.857143
Orangetown STP Outfall                      854.192982
Mohawk River at Waterford                   621.057143
Piermont Pier                               482.165775
Name: EnteroCount, dtype: float64

In [21]:
# write the results to a csv and read in for more flexible use (now and later)
mean_entero.to_csv( 'mean_entero.csv' )

In [22]:
mean_entero1 = pd.read_csv('mean_entero.csv')
mean_entero1.columns = ['Site','Average_Entero_Count']
mean_entero1.head() #make sure it looks right


Out[22]:
Site Average_Entero_Count
0 Croton Point Beach 15.458333
1 Stony Point mid-channel 17.340909
2 Little Stony Point 17.526316
3 Poughkeepsie Launch Ramp 17.675676
4 Haverstraw Bay mid-channel 18.708333

In [164]:
# Plot the results
%matplotlib inline
sns.set_style("whitegrid")
sns.set_style("ticks")
plt.figure(figsize=(10, 20))
ax = sns.barplot(x ="Average_Entero_Count", y="Site", data=mean_entero1, palette="Blues_d")
ax.set(xlabel='Average Entero Count', ylabel='Site')
ax.set_title('Waterways ranked by Average Entero Count')
sns.despine()

# Don't go swimming in the Gowanus Canal


The testing of water quality can be sporadic. Which sites have been tested most regularly? Which ones have long gaps between tests? Pick out 5-10 sites and visually compare how regularly their water quality is tested.


In [68]:
# Are we looking for consistancy or frequency in testing??
# Note to self - learn more about plotting time series data

# Determine the most recent reading for each site (might be useful?)
maxDate = raw_data.groupby(by=["Site"])["Date"].max()
maxDate.sort_values(ascending=False, inplace=True)

# what are our most FREQUENTLY tested Sites?
test_Site_counts = raw_data.groupby(by=["Site"])["Date"].count()
test_Site_counts.sort_values(ascending=False, inplace=True)

# Print out results
print "Most recently tested site dates"
print "++++++++++++++++++++++++"
print maxDate.head(15)

print "\n"

print "Most frequently tested sites"
print "++++++++++++++++++++++++"
print test_Site_counts.head(15)


Most recently tested site dates
++++++++++++++++++++++++
Site
Gay's Point mid-channel        2013-10-21
Athens                         2013-10-21
Hudson River above Troy Lock   2013-10-21
Hudson above Mohawk River      2013-10-21
Island Creek/Normans Kill      2013-10-21
Mohawk River at Waterford      2013-10-21
Dunn Memorial Bridge- Albany   2013-10-21
Castleton                      2013-10-21
Bethlehem Launch Ramp          2013-10-21
Coeymans Landing               2013-10-21
Hudson Landing Ramp            2013-10-21
Albany Rowing Dock             2013-10-21
Congress St. Bridge- Troy      2013-10-21
Coxsackie Waterfront Park      2013-10-21
Esopus Creek West              2013-10-20
Name: Date, dtype: datetime64[ns]


Most frequently tested sites
++++++++++++++++++++++++
Site
Piermont Pier                              186
Upper Sparkill Creek                       164
125th St. Pier                              65
Nyack Launch Ramp                           60
TZ Bridge mid-channel                       56
Newtown Creek- Dutch Kills                  56
Newtown Creek- Metropolitan Ave. Bridge     56
Orangetown STP Outfall                      56
Yonkers mid-channel                         51
Yonkers STP Outfall                         50
Saw Mill River                              49
East River mid-channel at 23rd St.          49
The Battery mid-channel                     48
GW Bridge mid-channel                       48
79th St. mid-channel                        48
Name: Date, dtype: int64

In [94]:
# Consistancy is more important
# Figure out how many days elapsed between readings and calculate average by group. 
# This gives us a general sense of the regularity of testing

# After lots of cursing and fiddling!
lag_date_data = raw_data.groupby('Site')['Date'].apply(lambda x: x.diff().mean()).reset_index().rename(columns={'Date':'Mean_lag_days'})
lag_date_data.head()


Out[94]:
Site Mean_lag_days
0 125th St. Pier -8 days +21:45:00
1 79th St. mid-channel 40 days 06:07:39.574468
2 Albany Rowing Dock 54 days 15:31:45.882352
3 Annesville Creek 51 days 08:40:00
4 Athens 35 days 02:10:54.545454

In [96]:
# write it to a csv to make it more flexible
lag_date_data.to_csv( 'lag_data.csv' )

In [113]:
# Clean up the csv for plotting
lag_data = pd.read_csv('lag_data.csv')
lag_data.columns = ['index','Site','mean_test_lag']
lag_data_num_col = pd.DataFrame(lag_data['mean_test_lag'].str.split().tolist(), columns=['mean_lag_days','junk','more_junk'])
mean_lag_data = pd.concat([lag_data, lag_data_num_col], axis=1)
del mean_lag_data['index']
del mean_lag_data['mean_test_lag']
del mean_lag_data['junk']
del mean_lag_data['more_junk']
# convert mean column to numeric type
mean_lag_data["mean_lag_days"] = pd.to_numeric(mean_lag_data["mean_lag_days"]) 
# apply absolute value to mean column
mean_lag_data['mean_lag_days'] = mean_lag_data['mean_lag_days'].abs()
mean_lag_data.head(10)


Out[113]:
Site mean_lag_days
0 125th St. Pier 8
1 79th St. mid-channel 40
2 Albany Rowing Dock 54
3 Annesville Creek 51
4 Athens 35
5 Beacon Harbor 21
6 Bethlehem Launch Ramp 34
7 Castle Point, NJ 18
8 Castleton 14
9 Catskill Creek- East End 38

In [114]:
print mean_lag_data.dtypes


Site             object
mean_lag_days     int64
dtype: object

In [123]:
# Get a general sense of the spread - on average how often are the sites tested
%matplotlib inline
plt.figure(figsize=(4.5, 10))
sns.violinplot(mean_lag_data, palette="colorblind")
sns.despine()



In [162]:
# Subset random sample of sites to compare and plot
# Random subset of 10% of the data
sub_lag_data =  mean_lag_data.sample(frac=0.1, replace=True)

%matplotlib inline
plt.figure(figsize=(10, 8))
ax = sns.barplot(x ="mean_lag_days", y="Site", data=sub_lag_data, palette="Blues_d")
ax.set(xlabel='Mean Lag Time', ylabel='Site')
ax.set_title('Mean Lag Time Between Tests in Days')
sns.despine()

# From the plot we can see there is a huge range in average lag between tests!


Is there a relationship between the amount of rain and water quality? Show this relationship graphically. If you can, estimate the effect of rain on quality at different sites and create a visualization to compare them.


In [203]:
# Cleanup so we're using just the entero count and rain totals
del raw_data['Site']
del raw_data['Date']
del raw_data['SampleCount']
raw_data.head()


Out[203]:
EnteroCount FourDayRainTotal
0 1733 1.5
1 4 0.2
2 20 0.0
3 6 0.0
4 31 0.0

In [253]:
%matplotlib inline
# Scatterplot to show relationship between rainfall and Entero Count
plt.figure(figsize=(7.5, 6.5))
sns.regplot('FourDayRainTotal', 'EnteroCount', data=raw_data, fit_reg=False, x_jitter=1)
sns.despine()
sns.plt.ylim(0)
sns.plt.xlim(0)

# There seems to be a relatively strong relationship between rainfall and water quality once
# you go beyond ~3 inches of rain. At that point there are much fewer high Entero Count readings


Out[253]:
(0, 10.0)

In [227]:
# Compare any two locations with the following

# re-read in/clean up
data_url = "https://raw.githubusercontent.com/jlaurito/CUNY_IS608/master/lecture4/data/riverkeeper_data_2013.csv"
site_rain_data = pd.read_csv(data_url)
site_rain_data["Site"] = site_rain_data["Site"].astype('category') # like facets
site_rain_data["Date"] = pd.to_datetime(site_rain_data["Date"]) # dates as dates
site_rain_data["EnteroCount"] = site_rain_data["EnteroCount"].str.lstrip('><')
site_rain_data["EnteroCount"] = site_rain_data["EnteroCount"].astype('int64')
site_rain_data.head()


Out[227]:
Site Date EnteroCount FourDayRainTotal SampleCount
0 Hudson above Mohawk River 2011-10-16 1733 1.5 35
1 Hudson above Mohawk River 2013-10-21 4 0.2 35
2 Hudson above Mohawk River 2013-09-21 20 0.0 35
3 Hudson above Mohawk River 2013-08-19 6 0.0 35
4 Hudson above Mohawk River 2013-07-21 31 0.0 35

Enter some sites into the search boxes


In [283]:
site1 = raw_input('enter your first site --> ') # Gowanus Canal <- worst


enter your first site --> Gowanus Canal

In [284]:
site2 = raw_input('enter your second site --> ') # Croton Point Beach <- best water


enter your second site --> Croton Point Beach

In [285]:
# Gowanus is dirtiest, Croton Point Beach is cleanest
x = site_rain_data.loc[site_rain_data['Site'] == site1]
y = site_rain_data.loc[site_rain_data['Site'] == site2]
frames = [x, y]
result = pd.concat(frames)
del result['Date']
del result['SampleCount']
result.head()


Out[285]:
Site EnteroCount FourDayRainTotal
3360 Gowanus Canal 10 0
3361 Gowanus Canal 63 0
3362 Gowanus Canal 41 0
3363 Gowanus Canal 364 1
3364 Gowanus Canal 10 0

In [287]:
plt.figure(figsize=(10, 10))
sns.lmplot('FourDayRainTotal', 'EnteroCount', data=result, hue='Site', palette="Set2", fit_reg=False, scatter_kws={"s": 50}, legend=False)
plt.legend(loc='upper right')
sns.despine()
sns.plt.ylim(0)
sns.plt.xlim(0)


Out[287]:
(0, 10.0)
<matplotlib.figure.Figure at 0x117aa7c10>