Import packages

Python has a variety of packages that contain functions for specific purposes. For example, "pandas" stands for Python Data Analysis Library and contains data analysis tools (http://pandas.pydata.org/). NumPy was designed for use in scientific computing and its functions are particularly useful for high-level computation (http://www.numpy.org/). Matplotlib is a plotting library that includes a wide range of useful plotting functions (http://matplotlib.org/). Packages have to be explicitly loaded to permit access to their functions.


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

Organize

First, let's import and organize the data that you downloaded from the CUAHSI HydroClient. You may need to update the file paths below if your directory structure is different than what is shown in the assignment document.

Import data

Upstream


In [2]:
## parse_dates tells python which column contains date/time data
## index_col assigns row index to specified column values
up_DO = pd.read_csv('Data/Clean/upstream/upstream-DO_mgL.csv',
                    parse_dates=['LocalTimestamp'],
                    index_col = 'LocalTimestamp')
up_level = pd.read_csv('Data/Clean/upstream/upstream-level_ft.csv',
                       parse_dates=['LocalTimestamp'],
                       index_col = 'LocalTimestamp')
up_wtemp = pd.read_csv('Data/Clean/upstream/upstream-wtemp_C.csv',
                       parse_dates=['LocalTimestamp'],
                       index_col = 'LocalTimestamp')

In [3]:
## rename columns
up_DO.columns = ['DO']
up_level.columns = ['Level']
up_wtemp.columns = ['WaterTemp']

Transitional


In [4]:
## upload transitional data
tr_DO = pd.read_csv('Data/Clean/transitional/transitional-DO_mgL.csv',
                    parse_dates=['LocalTimestamp'],
                    index_col = 'LocalTimestamp')
tr_level = pd.read_csv('Data/Clean/transitional/transitional-level_ft.csv',
                       parse_dates=['LocalTimestamp'],
                       index_col = 'LocalTimestamp')
tr_wtemp = pd.read_csv('Data/Clean/transitional/transitional-wtemp_C.csv',
                       parse_dates=['LocalTimestamp'],
                       index_col = 'LocalTimestamp')

In [5]:
## rename columns
tr_DO.columns = ['DO']
tr_level.columns = ['Level']
tr_wtemp.columns = ['WaterTemp']

Estuarine


In [6]:
## upload estuarine data
es_DO = pd.read_csv('Data/Clean/estuarine/estuarine-DO_mgL.csv',
                    parse_dates=['LocalTimestamp'],
                    index_col = 'LocalTimestamp')
es_level = pd.read_csv('Data/Clean/estuarine/estuarine-level_ft.csv',
                       parse_dates=['LocalTimestamp'],
                       index_col = 'LocalTimestamp')
es_wtemp = pd.read_csv('Data/Clean/estuarine/estuarine-wtemp_C.csv',
                       parse_dates=['LocalTimestamp'],
                       index_col = 'LocalTimestamp')

In [7]:
## rename columns
es_DO.columns = ['DO']
es_level.columns = ['Level']
es_wtemp.columns = ['WaterTemp']

The data are currently in individual arrays, but they should ideally be organized into datasets that correspond to each of the river reaches that we're evaluating.

Generate datasets for the river reaches


In [8]:
## let's look at the first three rows of the upstream DO data
up_DO[:3]


Out[8]:
DO
LocalTimestamp
2015-01-01 00:00:00 8.9
2015-01-01 01:00:00 8.9
2015-01-01 02:00:00 8.9

In [9]:
## now let's look at the upstream water level data
up_level[:3]


Out[9]:
Level
LocalTimestamp
2015-01-01 00:00:00 1.9
2015-01-01 00:15:00 1.9
2015-01-01 00:30:00 1.9

Note that these data have different temporal resolution - the DO data are hourly, but the water level data were captured every 15 minutes. We want to create a single data frame that contains our total dataset, but we need to make sure the data collection timestamps match.

Upstream dataset


In [10]:
## to do this, we can merge the data using the LocalTimestamp columns in each of our imported 
## data series:
upstream = up_level.join(up_DO)
upstream = upstream.join(up_wtemp)

In [11]:
## let's look at the first few rows of our newly merged data frame named "upstream"
upstream[:10]


Out[11]:
Level DO WaterTemp
LocalTimestamp
2015-01-01 00:00:00 1.90 8.9 17.6
2015-01-01 00:15:00 1.90 NaN NaN
2015-01-01 00:30:00 1.90 NaN NaN
2015-01-01 00:45:00 1.91 NaN NaN
2015-01-01 01:00:00 1.91 8.9 17.6
2015-01-01 01:15:00 1.92 NaN NaN
2015-01-01 01:30:00 1.92 NaN NaN
2015-01-01 01:45:00 1.93 NaN NaN
2015-01-01 02:00:00 1.93 8.9 17.6
2015-01-01 02:15:00 1.93 NaN NaN

Transitional dataset


In [12]:
transitional = tr_level.join(tr_DO)
transitional = transitional.join(tr_wtemp)
transitional[:5]


Out[12]:
Level DO WaterTemp
LocalTimestamp
2015-01-01 00:00:00 1.09 7.8 18.4
2015-01-01 00:15:00 1.10 7.8 18.4
2015-01-01 00:30:00 1.09 7.8 18.4
2015-01-01 00:45:00 1.09 7.7 18.4
2015-01-01 01:00:00 1.09 7.7 18.4

Note that all transitional data have a timestep of 15 minutes.

Estuarine dataset


In [13]:
estuarine = es_level.join(es_DO)
estuarine = estuarine.join(es_wtemp)
estuarine[:5]


Out[13]:
Level DO WaterTemp
LocalTimestamp
2015-01-01 00:00:00 -0.50 8.8 14.5
2015-01-01 00:15:00 -0.53 NaN NaN
2015-01-01 00:30:00 -0.54 NaN NaN
2015-01-01 00:45:00 -0.57 NaN NaN
2015-01-01 01:00:00 -0.58 8.6 14.3

LSJR dataset (ALL data: upstream, transitional, estuarine)


In [14]:
lsjr = upstream.join(transitional,rsuffix='_tr')
lsjr = lsjr.join(estuarine,rsuffix='_es')
lsjr.columns.values[0:3] = ['Level_up','DO_up','WaterTemp_up']
lsjr[:5]


Out[14]:
Level_up DO_up WaterTemp_up Level_tr DO_tr WaterTemp_tr Level_es DO_es WaterTemp_es
LocalTimestamp
2015-01-01 00:00:00 1.90 8.9 17.6 1.09 7.8 18.4 -0.50 8.8 14.5
2015-01-01 00:15:00 1.90 NaN NaN 1.10 7.8 18.4 -0.53 NaN NaN
2015-01-01 00:30:00 1.90 NaN NaN 1.09 7.8 18.4 -0.54 NaN NaN
2015-01-01 00:45:00 1.91 NaN NaN 1.09 7.7 18.4 -0.57 NaN NaN
2015-01-01 01:00:00 1.91 8.9 17.6 1.09 7.7 18.4 -0.58 8.6 14.3

Analyze

Let's start playing with the data! For this analysis, we're going to keep it simple. The code blocks below produce plots that allow for data to be visualized in a few different ways.

As you inspect these plots, think about the following: do you see any relationships between water temperature, water level, and DO? Are there seasonal differences in these relationships? How to patterns in the data change between the upstream, transitional, and estuarine sites?

DO


In [15]:
## use pd.concat to create a new dataframe composed of specific lsjr columns
df = pd.concat([lsjr['DO_up'],lsjr['DO_tr'],lsjr['DO_es']],axis=1)

## create a multi-line plot
df.plot(marker='.')

## plot each of the three DO time series independently
up_DO.plot(marker='.', title = 'Upstream')
tr_DO.plot(marker='.', title = 'Transitional')
es_DO.plot(marker='.', title = 'Estuarine')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1125bfd10>

In [16]:
## increase plot size for subsequent figures
plt.rcParams['figure.figsize'] = (16,12)

We can also look at seasonal variability by looking at specific months. The code below creates data subsets for January and July. Do you think these are the best months to evaluate to detect seasonal variation in the data? To look at other months, simply edit the code.


In [17]:
## create July subset and plot
dfjul = df[(df.index.month == 7)]
dfjul.dropna().plot(marker='.',title='July') # here, drop.na() has been used to get rid of 
                                             # rows that contain NA values, since these will
                                             # interfere with the plot.

## create January subset and plot
dfjan = df[(df.index.month == 1)]
dfjan.dropna().plot(marker='.',title='January')


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x112cb3850>

In [18]:
## July and January boxplots
dfjul.plot.box()
dfjan.plot.box()


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x111dba190>

Water level


In [19]:
## use pd.concat to create a new dataframe composed of specific lsjr columns
df_lv = pd.concat([lsjr['Level_up'],lsjr['Level_tr'],lsjr['Level_es']],axis=1)

## create July subset and plot
df_lv_jul = df_lv[(df.index.month == 7)]
df_lv_jul.dropna().plot(marker='.',title='July water level')

## create January subset and plot
df_lv_jan = df_lv[(df.index.month == 1)]
df_lv_jan.dropna().plot(marker='.',title='January water level')


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1165d84d0>

Water temperature


In [20]:
## use pd.concat to create a new dataframe composed of specific lsjr columns
df_wt = pd.concat([lsjr['WaterTemp_up'],lsjr['WaterTemp_tr'],lsjr['WaterTemp_es']],axis=1)

## create July subset and plot
df_wt_jul = df_wt[(df_wt.index.month == 7)]
df_wt_jul.dropna().plot(marker='.',title='July water temperature')

## create January subset and plot
df_wt_jan = df_wt[(df_wt.index.month == 1)]
df_wt_jan.dropna().plot(marker='.',title='January water temperature')


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x112eefc50>

Upstream


In [21]:
## plot entire upstream dataset
upstream.dropna().plot(marker='.',title='Upstream')

## create July subset and plot
upstream_jul = upstream[(upstream.index.month == 7)]
upstream_jul.dropna().plot(marker='.',title='Upstream - July')

## create January subset and plot
upstream_jan = upstream[(upstream.index.month == 1)]
upstream_jan.dropna().plot(marker='.',title='Upstream - January')


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1165b8310>

Transitional


In [22]:
## plot entire transitional dataset
transitional.dropna().plot(marker='.',title='Transitional')

## create July subset and plot
transitional_jul = transitional[(transitional.index.month == 7)]
transitional_jul.dropna().plot(marker='.',title='Transitional - July')

## create January subset and plot
transitional_jan = transitional[(transitional.index.month == 1)]
transitional_jan.dropna().plot(marker='.',title='Transitional - January')


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x116fee150>

Estuarine


In [23]:
## plot entire estuarine dataset
estuarine.dropna().plot(marker='.',title='Estuarine')

## create July subset and plot
estuarine_jul = estuarine[(estuarine.index.month == 7)]
estuarine_jul.dropna().plot(marker='.',title='Estuarine - July')

## create January subset and plot
estuarine_jan = estuarine[(estuarine.index.month == 1)]
estuarine_jan.dropna().plot(marker='.',title='Estuarine - January')


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x11298eb90>

Quantify

Now that you've visually inspected the time series and boxplots, calculate correlation and cross-correlation strength of the relationships that you were able to visually detect. But, remember, correlation does NOT = causation!

Correlations by river reach

Note: edit the cells below to look at specific months, i.e. upstream_jan.corr(), upstream_jul.corr()


In [24]:
upstream.corr()


Out[24]:
Level DO WaterTemp
Level 1.000000 0.511655 -0.432524
DO 0.511655 1.000000 -0.761935
WaterTemp -0.432524 -0.761935 1.000000

In [25]:
transitional.corr()


Out[25]:
Level DO WaterTemp
Level 1.000000 0.017245 -0.210073
DO 0.017245 1.000000 -0.641332
WaterTemp -0.210073 -0.641332 1.000000

In [26]:
estuarine.corr()


Out[26]:
Level DO WaterTemp
Level 1.000000 -0.086737 -0.011462
DO -0.086737 1.000000 -0.955116
WaterTemp -0.011462 -0.955116 1.000000

Cross-correlations

Upstream


In [27]:
## lag of 1 = 1 hour for upstream and estuarine, 15 minutes for transitional 
## (remember the temporal resolution of the different datasets)
lag = 1

In [28]:
upstream_jul['DO'].dropna().corr(upstream_jul['WaterTemp'].dropna().shift(lag))


Out[28]:
0.65077418240504925

In [29]:
upstream_jul['DO'].dropna().corr(upstream_jul['Level'].dropna().shift(lag))


Out[29]:
0.45092953163348298

In [30]:
upstream_jan['DO'].dropna().corr(upstream_jan['WaterTemp'].dropna().shift(lag))


Out[30]:
-0.92657696777632725

In [31]:
upstream_jan['DO'].dropna().corr(upstream_jan['Level'].dropna().shift(lag))


Out[31]:
-0.036902491382764128

Transitional


In [32]:
lag = 4

In [33]:
transitional_jul['DO'].dropna().corr(transitional_jul['WaterTemp'].dropna().shift(lag))


Out[33]:
0.56492389713058155

In [34]:
transitional_jul['DO'].dropna().corr(transitional_jul['Level'].dropna().shift(lag))


Out[34]:
-0.36938618222952407

In [35]:
transitional_jan['DO'].dropna().corr(transitional_jan['WaterTemp'].dropna().shift(lag))


Out[35]:
-0.71206739233918825

In [36]:
transitional_jan['DO'].dropna().corr(transitional_jan['Level'].dropna().shift(lag))


Out[36]:
-0.15415676138880124

Estuarine


In [37]:
lag = 1

In [38]:
estuarine_jul['DO'].dropna().corr(estuarine_jul['WaterTemp'].dropna().shift(lag))


Out[38]:
-0.60193670671041755

In [39]:
estuarine_jul['DO'].dropna().corr(estuarine_jul['Level'].dropna().shift(lag))


Out[39]:
0.30885212253042515

In [40]:
estuarine_jan['DO'].dropna().corr(estuarine_jan['WaterTemp'].dropna().shift(lag))


Out[40]:
-0.88784959753561787

In [41]:
estuarine_jan['DO'].dropna().corr(estuarine_jan['Level'].dropna().shift(lag))


Out[41]:
-0.20445073547525111

Discuss

Return to your assignment and answer the discussion questions.