1D analysis: pandas!


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
pd.set_option('max_rows', 6)  # max number of rows to show in this notebook — to save space!

import seaborn as sns  # for better style in plots

Reading in data to a dataframe

For 1D analysis, we are generally thinking about data that varies in time, so time series analysis. The pandas package is particularly suited to deal with this type of data, having very convenient methods for interpreting, searching through, and using time representations.

Let's start with the example we started the class with: taxi rides in New York City.


In [2]:
df = pd.read_csv('../data/yellow_tripdata_2016-05-01_decimated.csv', parse_dates=[0, 2], index_col=[0])

What do all these (and other) input keyword arguments do?

  • header: tells which row of the data file is the header, from which it will extract column names
  • parse_dates: try to interpret the values in [col] or [[col1, col2]] as dates, to convert them into datetime objects.
  • index_col: if no index column is given, an index counting from 0 is given to the rows. By inputting index_col=[column integer], that column will be used as the index instead. This is usually done with the time information for the dataset.
  • skiprows: can skip specific rows, skiprows=[list of rows to skip numbered from start of file with 0], or number of rows to skip, skiprows=N.

We can check to make sure the date/time information has been read in as the index, which allows us to reference the other columns using this time information really easily:


In [3]:
df.index


Out[3]:
DatetimeIndex(['2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               ...
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00'],
              dtype='datetime64[ns]', name='tpep_pickup_datetime', length=38211, freq=None)

From this we see that the index is indeed using the timing information in the file, and we can see that the dtype is datetime.

Selecting rows and columns of data

In particular, we will select rows based on the index. Since in this example we are indexing by time, we can use human-readable notation to select based on date/times themselves instead of index. Columns can be selected by name.

We can now access the columns of the file using dictionary-like keyword arguments, like so:


In [4]:
df['trip_distance']


Out[4]:
tpep_pickup_datetime
2016-05-01 00:00:00     3.60
2016-05-01 00:00:00    18.41
2016-05-01 00:00:00     1.60
                       ...  
2016-05-01 23:59:00    19.80
2016-05-01 23:59:00     0.50
2016-05-01 23:59:00     4.05
Name: trip_distance, Length: 38211, dtype: float64

We can equivalently access the columns of data as if they are methods. This means that we can use tab autocomplete to see methods and data available in a dataframe.


In [5]:
df.trip_distance


Out[5]:
tpep_pickup_datetime
2016-05-01 00:00:00     3.60
2016-05-01 00:00:00    18.41
2016-05-01 00:00:00     1.60
                       ...  
2016-05-01 23:59:00    19.80
2016-05-01 23:59:00     0.50
2016-05-01 23:59:00     4.05
Name: trip_distance, Length: 38211, dtype: float64

We can plot in this way, too:


In [6]:
df['trip_distance'].plot(figsize=(14,6))


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2856751f98>

Simple data selection

One of the biggest benefits of using pandas is being able to easily reference the data in intuitive ways. For example, because we set up the index of the dataframe to be the date and time, we can pull out data using dates. In the following, we pull out all data from the first hour of the day:


In [7]:
df['2016-05-01 00']


Out[7]:
VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
tpep_pickup_datetime
2016-05-01 00:00:00 1 2016-05-01 00:17:00 1 3.60 -73.985901 40.768040 1 N -73.983986 40.730099 1 15.0 0.5 0.5 1.50 0.0 0.3 17.80
2016-05-01 00:00:00 2 2016-05-01 00:00:00 2 18.41 -73.781059 40.645020 2 N -73.984573 40.768398 2 52.0 0.0 0.5 0.00 0.0 0.3 52.80
2016-05-01 00:00:00 2 2016-05-01 00:08:00 1 1.60 -74.002769 40.728619 1 N -73.984161 40.737438 1 8.0 0.5 0.5 1.86 0.0 0.3 11.16
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-05-01 00:59:00 1 2016-05-01 01:09:00 2 2.30 -73.999451 40.728214 1 N -73.997505 40.745892 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
2016-05-01 00:59:00 2 2016-05-01 01:14:00 5 3.58 -74.001373 40.761669 1 N -73.967537 40.791241 1 14.5 0.5 0.5 3.16 0.0 0.3 18.96
2016-05-01 00:59:00 2 2016-05-01 01:08:00 1 1.70 -73.986198 40.752426 1 N -74.003548 40.740318 1 8.0 0.5 0.5 2.79 0.0 0.3 12.09

2447 rows × 18 columns

Here we further subdivide to examine the passenger count during that time period:


In [8]:
df['passenger_count']['2016-05-01 00']


Out[8]:
tpep_pickup_datetime
2016-05-01 00:00:00    1
2016-05-01 00:00:00    2
2016-05-01 00:00:00    1
                      ..
2016-05-01 00:59:00    2
2016-05-01 00:59:00    5
2016-05-01 00:59:00    1
Name: passenger_count, Length: 2447, dtype: int64

We can also access a range of data, for example any data rows from midnight until noon:


In [9]:
df['2016-05-01 00':'2016-05-01 11']


Out[9]:
VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
tpep_pickup_datetime
2016-05-01 00:00:00 1 2016-05-01 00:17:00 1 3.60 -73.985901 40.768040 1 N -73.983986 40.730099 1 15.0 0.5 0.5 1.50 0.00 0.3 17.80
2016-05-01 00:00:00 2 2016-05-01 00:00:00 2 18.41 -73.781059 40.645020 2 N -73.984573 40.768398 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
2016-05-01 00:00:00 2 2016-05-01 00:08:00 1 1.60 -74.002769 40.728619 1 N -73.984161 40.737438 1 8.0 0.5 0.5 1.86 0.00 0.3 11.16
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-05-01 11:59:00 2 2016-05-01 12:01:00 5 0.03 -73.997009 40.766052 1 N -73.996841 40.766403 2 3.0 0.0 0.5 0.00 0.00 0.3 3.80
2016-05-01 11:59:00 1 2016-05-01 12:15:00 1 1.90 -74.006813 40.730862 1 N -73.987183 40.714912 1 11.0 0.0 0.5 2.35 0.00 0.3 14.15
2016-05-01 11:59:00 2 2016-05-01 12:43:00 2 13.63 -73.977943 40.762718 1 N -73.965538 40.626259 1 43.5 0.0 0.5 5.00 5.54 0.3 54.84

15951 rows × 18 columns

If you want more choice in your selection

The following, adding on minutes, does not work:


In [10]:
df['2016-05-01 00:30']


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2889             try:
-> 2890                 return self._engine.get_loc(key)
   2891             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '2016-05-01 00:30'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-fb56de796e39> in <module>
----> 1 df['2016-05-01 00:30']

/opt/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2973             if self.columns.nlevels > 1:
   2974                 return self._getitem_multilevel(key)
-> 2975             indexer = self.columns.get_loc(key)
   2976             if is_integer(indexer):
   2977                 indexer = [indexer]

/opt/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2890                 return self._engine.get_loc(key)
   2891             except KeyError:
-> 2892                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2893         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2894         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '2016-05-01 00:30'

However, we can use another approach to have more control, with .loc to access combinations of specific columns and/or rows, or subsets of columns and/or rows.


In [11]:
df.loc['2016-05-01 00:30']


Out[11]:
VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
tpep_pickup_datetime
2016-05-01 00:30:00 1 2016-05-01 00:41:00 1 3.10 -73.984566 40.752789 1 N -73.954048 40.780731 1 11.5 0.5 0.5 2.55 0.0 0.3 15.35
2016-05-01 00:30:00 2 2016-05-01 00:37:00 1 1.34 -73.987144 40.760605 1 N -73.990891 40.750931 1 7.0 0.5 0.5 1.66 0.0 0.3 9.96
2016-05-01 00:30:00 1 2016-05-01 00:35:00 3 0.80 -73.994965 40.727703 1 N -74.000732 40.732288 1 5.5 0.5 0.5 1.70 0.0 0.3 8.50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-05-01 00:30:00 1 2016-05-01 00:35:00 1 0.70 -74.000664 40.737263 1 N -74.000732 40.728989 1 5.0 0.5 0.5 1.25 0.0 0.3 7.55
2016-05-01 00:30:00 2 2016-05-01 00:44:00 2 5.51 -73.993782 40.712341 1 N -73.958893 40.768017 1 18.0 0.5 0.5 2.00 0.0 0.3 21.30
2016-05-01 00:30:00 2 2016-05-01 00:40:00 2 1.05 -73.989548 40.747219 1 N -73.987579 40.738537 1 7.0 0.5 0.5 1.00 0.0 0.3 9.30

41 rows × 18 columns

You can also select data for more specific time periods.

df.loc[row_label, col_label]


In [12]:
df.loc['2016-05-01 00:30', 'passenger_count']


Out[12]:
tpep_pickup_datetime
2016-05-01 00:30:00    1
2016-05-01 00:30:00    1
2016-05-01 00:30:00    3
                      ..
2016-05-01 00:30:00    1
2016-05-01 00:30:00    2
2016-05-01 00:30:00    2
Name: passenger_count, Length: 41, dtype: int64

You can select more than one column:


In [13]:
df.loc['2016-05-01 00:30', ['passenger_count','trip_distance']]


Out[13]:
passenger_count trip_distance
tpep_pickup_datetime
2016-05-01 00:30:00 1 3.10
2016-05-01 00:30:00 1 1.34
2016-05-01 00:30:00 3 0.80
... ... ...
2016-05-01 00:30:00 1 0.70
2016-05-01 00:30:00 2 5.51
2016-05-01 00:30:00 2 1.05

41 rows × 2 columns

You can select a range of data:


In [14]:
df.loc['2016-05-01 00:30':'2016-05-01 01:30', ['passenger_count','trip_distance']]


Out[14]:
passenger_count trip_distance
tpep_pickup_datetime
2016-05-01 00:30:00 1 3.10
2016-05-01 00:30:00 1 1.34
2016-05-01 00:30:00 3 0.80
... ... ...
2016-05-01 01:30:00 2 0.70
2016-05-01 01:30:00 1 1.46
2016-05-01 01:30:00 1 1.64

2311 rows × 2 columns

You can alternatively select data by index instead of by label, using iloc instead of loc. Here we select the first 5 rows of data for all columns:


In [15]:
df.iloc[0:5, :]


Out[15]:
VendorID tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
tpep_pickup_datetime
2016-05-01 1 2016-05-01 00:17:00 1 3.60 -73.985901 40.768040 1 N -73.983986 40.730099 1 15.0 0.5 0.5 1.50 0.0 0.3 17.80
2016-05-01 2 2016-05-01 00:00:00 2 18.41 -73.781059 40.645020 2 N -73.984573 40.768398 2 52.0 0.0 0.5 0.00 0.0 0.3 52.80
2016-05-01 2 2016-05-01 00:08:00 1 1.60 -74.002769 40.728619 1 N -73.984161 40.737438 1 8.0 0.5 0.5 1.86 0.0 0.3 11.16
2016-05-01 1 2016-05-01 00:19:00 1 1.90 -73.987854 40.749043 1 N -73.992203 40.725452 1 13.0 0.5 0.5 2.85 0.0 0.3 17.15
2016-05-01 2 2016-05-01 00:10:00 1 2.37 -74.006760 40.716454 1 N -73.989731 40.746796 1 10.0 0.5 0.5 2.26 0.0 0.3 13.56

Exercise

Access the data from dataframe df for the last three hours of the day at once. Plot the tip amount (tip_amount) for this time period.

After you can make a line plot, try making a histogram of the data. Play around with the data range and the number of bins. A number of plot types are available built-in to a pandas dataframe inside the plot method under the keyword argument kind.



In [ ]:


Exercise

Using pandas, read in the CTD data we've used in class several times. What variable would make sense to use for your index column?



In [ ]:

Notes about datetimes

You can change the format of datetimes using strftime().

Compare the datetimes in our dataframe index in the first cell below with the second cell, in which we format the look of the datetimes differently. We can choose how it looks using formatting codes. You can find a comprehensive list of the formatting directives at http://strftime.org/. Note that inside the parentheses, you can write other characters that will be passed through (like the comma in the example below).


In [16]:
df = pd.read_csv('../data/yellow_tripdata_2016-05-01_decimated.csv', parse_dates=[0, 2], index_col=[0])
df.index


Out[16]:
DatetimeIndex(['2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               ...
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00'],
              dtype='datetime64[ns]', name='tpep_pickup_datetime', length=38211, freq=None)

In [17]:
df.index.strftime('%b %d, %Y %H:%m')


Out[17]:
Index(['May 01, 2016 00:05', 'May 01, 2016 00:05', 'May 01, 2016 00:05',
       'May 01, 2016 00:05', 'May 01, 2016 00:05', 'May 01, 2016 00:05',
       'May 01, 2016 00:05', 'May 01, 2016 00:05', 'May 01, 2016 00:05',
       'May 01, 2016 00:05',
       ...
       'May 01, 2016 23:05', 'May 01, 2016 23:05', 'May 01, 2016 23:05',
       'May 01, 2016 23:05', 'May 01, 2016 23:05', 'May 01, 2016 23:05',
       'May 01, 2016 23:05', 'May 01, 2016 23:05', 'May 01, 2016 23:05',
       'May 01, 2016 23:05'],
      dtype='object', length=38211)

You can create and use datetimes using pandas. It will interpret the information you put into a string as best it can. Year-month-day is a good way to put in dates instead of using either American or European-specific ordering.

After defining a pandas Timestamp, you can also change time using Timedelta.


In [18]:
now = pd.Timestamp('October 22, 2019 1:19PM')
now


Out[18]:
Timestamp('2019-10-22 13:19:00')

In [19]:
tomorrow = pd.Timedelta('1 day')
now + tomorrow


Out[19]:
Timestamp('2019-10-23 13:19:00')

You can set up a range of datetimes to make your own data frame indices with the following. Codes for frequency are available.


In [20]:
pd.date_range(start='Jan 1 2019', end='May 1 2019', freq='15T')


Out[20]:
DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:15:00',
               '2019-01-01 00:30:00', '2019-01-01 00:45:00',
               '2019-01-01 01:00:00', '2019-01-01 01:15:00',
               '2019-01-01 01:30:00', '2019-01-01 01:45:00',
               '2019-01-01 02:00:00', '2019-01-01 02:15:00',
               ...
               '2019-04-30 21:45:00', '2019-04-30 22:00:00',
               '2019-04-30 22:15:00', '2019-04-30 22:30:00',
               '2019-04-30 22:45:00', '2019-04-30 23:00:00',
               '2019-04-30 23:15:00', '2019-04-30 23:30:00',
               '2019-04-30 23:45:00', '2019-05-01 00:00:00'],
              dtype='datetime64[ns]', length=11521, freq='15T')

Note that you can get many different measures of your time index.


In [21]:
df.index.minute


Out[21]:
Int64Index([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
            ...
            59, 59, 59, 59, 59, 59, 59, 59, 59, 59],
           dtype='int64', name='tpep_pickup_datetime', length=38211)

In [22]:
df.index.dayofweek


Out[22]:
Int64Index([6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            ...
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64', name='tpep_pickup_datetime', length=38211)

Exercise

How would you change the call to strftime above to format all of the indices such that the first index, for example, would be "the 1st of May, 2016 at the hour of 00 and the minute of 00 and the seconds of 00, which is the following day of the week: Sunday." Use the format codes for as many of the values as possible.



In [ ]:

Adding column to dataframe

We can add data to our dataframe very easily. Below we add an index that gives the minute in the hour throughout the day.


In [23]:
df['tip squared'] = df.tip_amount**2  # making up some numbers to save to a new column
df['tip squared'].plot()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2853f52b00>

Another example: Wind data

Let's read in the wind data file that we have used before to have another data set to use. Note the parameters used to read it in properly.


In [24]:
df2 = pd.read_table('../data/burl1h2010.txt', header=0, skiprows=[1], delim_whitespace=True, 
                    parse_dates={'dates': ['#YY', 'MM', 'DD', 'hh']},  index_col=0)
df2


Out[24]:
mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE
dates
2010-01-01 00:00:00 0 134 3.0 3.0 99.0 99.0 99.0 999 1019.1 13.3 999.0 12.4 99.0 99.0
2010-01-01 01:00:00 0 123 3.6 3.6 99.0 99.0 99.0 999 1019.1 13.6 999.0 12.7 99.0 99.0
2010-01-01 02:00:00 0 130 3.3 3.3 99.0 99.0 99.0 999 1019.7 13.6 999.0 12.9 99.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2010-12-31 21:00:00 0 147 11.4 11.8 99.0 99.0 99.0 999 1012.4 20.4 999.0 19.2 99.0 99.0
2010-12-31 22:00:00 0 152 11.9 12.9 99.0 99.0 99.0 999 1012.6 20.6 999.0 19.6 99.0 99.0
2010-12-31 23:00:00 0 150 10.2 10.9 99.0 99.0 99.0 999 1012.8 20.4 999.0 19.6 99.0 99.0

8755 rows × 14 columns


In [25]:
df2.index


Out[25]:
DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:00:00',
               '2010-01-01 02:00:00', '2010-01-01 03:00:00',
               '2010-01-01 04:00:00', '2010-01-01 05:00:00',
               '2010-01-01 06:00:00', '2010-01-01 07:00:00',
               '2010-01-01 08:00:00', '2010-01-01 09:00:00',
               ...
               '2010-12-31 14:00:00', '2010-12-31 15:00:00',
               '2010-12-31 16:00:00', '2010-12-31 17:00:00',
               '2010-12-31 18:00:00', '2010-12-31 19:00:00',
               '2010-12-31 20:00:00', '2010-12-31 21:00:00',
               '2010-12-31 22:00:00', '2010-12-31 23:00:00'],
              dtype='datetime64[ns]', name='dates', length=8755, freq=None)

Plotting with pandas

You can plot with matplotlib and control many things directly from pandas. Get more info about plotting from pandas dataframes directly from:


In [26]:
df.plot?

You can mix and match plotting with matplotlib by either setting up a figure and axes you want to use with calls to plot from your dataframe (which you input to the plot call), or you can start with a pandas plot and save an axes from that call. Each will be demonstrated next. Or, you can bring the pandas data to matplotlib fully.

Start from matplotlib, then input axes to pandas

To demonstrate plotting starting from matplotlib, we will also demonstrate a note about column selection for plotting. You can select which data columns to plot either by selecting in the line before the plot call, or you can choose the columns within the plot call.

The key part here is that you input to your pandas plot call the axes you wanted plotted into (here: ax=axes[0]).


In [27]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14,4))

df2['WSPD']['2010-5'].plot(ax=axes[0])
df2.loc['2010-5'].plot(y='WSPD', ax=axes[1])


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28533af550>

Start with pandas, then use matplotlib commands

The important part here is that the call to pandas dataframe plotting returns an axes handle which you can save; here, it is saved as "ax".


In [28]:
ax = df2['WSPD']['2010 11 1'].plot()
ax.set_ylabel('Wind speed')


Out[28]:
Text(0, 0.5, 'Wind speed')

Bring pandas dataframe data to matplotlib fully

You can also use matplotlib directly by pulling the data you want to plot out of your dataframe.


In [29]:
plt.plot(df2['WSPD'])


Out[29]:
[<matplotlib.lines.Line2D at 0x7f2851a4a4a8>]

Plot all or multiple columns at once


In [30]:
# all
df2.plot()


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28519b30f0>

To plot more than one but less than all columns, give a list of column names. Here are two ways to do the same thing:


In [31]:
# multiple
fig, axes = plt.subplots(1, 2, figsize=(14,4))
df2[['WSPD', 'GST']].plot(ax=axes[0])
df2.plot(y=['WSPD', 'GST'], ax=axes[1])


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2851886c50>

Formatting dates

You can control how datetimes look on the x axis in these plots as demonstrated in this section. The formatting codes used in the call to DateFormatter are the same as those used above in this notebook for strftime.

Note that you can also control all of this with minor ticks additionally.


In [32]:
ax = df2['WSPD'].plot(figsize=(14,4))



In [33]:
from matplotlib.dates import DateFormatter

ax = df2['WSPD'].plot(figsize=(14,4))
ax.set_xlabel('2010')

date_form = DateFormatter("%b %d")
ax.xaxis.set_major_formatter(date_form)

# import matplotlib.dates as mdates
# # You can also control where the ticks are located, by date with Locators
# ticklocations = mdates.MonthLocator()
# ax.xaxis.set_major_locator(ticklocations)


Plotting with twin axis

You can very easily plot two variables with different y axis limits with the secondary_y keyword argument to df.plot.


In [34]:
axleft = df2['WSPD']['2010-10'].plot(figsize=(14,4))
axright = df2['WDIR']['2010-10'].plot(secondary_y=True, alpha=0.5)
axleft.set_ylabel('Speed [m/s]', color='blue');
axright.set_ylabel('Dir [degrees]', color='orange');


Resampling

Sometimes we want our data to be at a different sampling frequency that we have, that is, we want to change the time between rows or observations. Changing this is called resampling. We can upsample to increase the number of data points in a given dataset (or decrease the period between points) or we can downsample to decrease the number of data points.

The wind data is given every hour. Here we downsample it to be once a day instead. After the resample function, a method needs to be used for how to combine the data over the downsampling period since the existing data needs to be combined in some way. We could use the max value over the 1-day period to represent each day:


In [35]:
df2.resample('1d').max()    #['DEWP']  # now the data is daily


Out[35]:
mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE
dates
2010-01-01 0 357 14.0 15.6 99.0 99.0 99.0 999 1025.5 13.6 999.0 12.9 99.0 99.0
2010-01-02 0 33 10.6 11.3 99.0 99.0 99.0 999 1028.8 9.9 999.0 4.5 99.0 99.0
2010-01-03 0 54 10.1 10.9 99.0 99.0 99.0 999 1025.7 8.4 999.0 3.3 99.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2010-12-29 0 170 11.4 12.1 99.0 99.0 99.0 999 1024.5 18.1 999.0 12.8 99.0 99.0
2010-12-30 0 322 14.9 16.1 99.0 99.0 99.0 999 1014.8 20.3 999.0 18.1 99.0 99.0
2010-12-31 0 152 12.0 13.8 99.0 99.0 99.0 999 1015.6 20.7 999.0 19.6 99.0 99.0

365 rows × 14 columns

It's always important to check our results to make sure they look reasonable. Let's plot our resampled data with the original data to make sure they align well. We'll choose one variable for this check.

We can see that the daily max wind gust does indeed look like the max value for each day, though note that it is plotted at the start of the day.


In [36]:
df2['GST']['2010-4-1':'2010-4-5'].plot()
df2.resample('1d').max()['GST']['2010-4-1':'2010-4-5'].plot()


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f285141b080>

We can also upsample our data or add more rows of data. Note that like before, after we resample our data we still need a method on the end telling pandas how to process the data. However, since in this case we are not combining data (downsampling) but are adding more rows (upsampling), using a function like max doesn't change the existing observations (taking the max of a single row). For the new rows, we haven't said how to fill them so they are nan's by default.

Here we are changing from having data every hour to having it every 30 minutes.


In [37]:
df2.resample('30min').max()  # max doesn't say what to do with data in new rows


Out[37]:
mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE
dates
2010-01-01 00:00:00 0.0 134.0 3.0 3.0 99.0 99.0 99.0 999.0 1019.1 13.3 999.0 12.4 99.0 99.0
2010-01-01 00:30:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010-01-01 01:00:00 0.0 123.0 3.6 3.6 99.0 99.0 99.0 999.0 1019.1 13.6 999.0 12.7 99.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2010-12-31 22:00:00 0.0 152.0 11.9 12.9 99.0 99.0 99.0 999.0 1012.6 20.6 999.0 19.6 99.0 99.0
2010-12-31 22:30:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010-12-31 23:00:00 0.0 150.0 10.2 10.9 99.0 99.0 99.0 999.0 1012.8 20.4 999.0 19.6 99.0 99.0

17519 rows × 14 columns

When upsampling, a reasonable option is to fill the new rows with data from the previous existing row:


In [38]:
df2.resample('30min').ffill()


Out[38]:
mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE
dates
2010-01-01 00:00:00 0 134 3.0 3.0 99.0 99.0 99.0 999 1019.1 13.3 999.0 12.4 99.0 99.0
2010-01-01 00:30:00 0 134 3.0 3.0 99.0 99.0 99.0 999 1019.1 13.3 999.0 12.4 99.0 99.0
2010-01-01 01:00:00 0 123 3.6 3.6 99.0 99.0 99.0 999 1019.1 13.6 999.0 12.7 99.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2010-12-31 22:00:00 0 152 11.9 12.9 99.0 99.0 99.0 999 1012.6 20.6 999.0 19.6 99.0 99.0
2010-12-31 22:30:00 0 152 11.9 12.9 99.0 99.0 99.0 999 1012.6 20.6 999.0 19.6 99.0 99.0
2010-12-31 23:00:00 0 150 10.2 10.9 99.0 99.0 99.0 999 1012.8 20.4 999.0 19.6 99.0 99.0

17519 rows × 14 columns

Here we upsample to have data every 15 minutes, but we interpolate to fill in the data between. This is a very useful thing to be able to do.


In [39]:
df2.resample('15 T').interpolate()


Out[39]:
mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE
dates
2010-01-01 00:00:00 0.0 134.00 3.000 3.00 99.0 99.0 99.0 999.0 1019.10 13.300 999.0 12.400 99.0 99.0
2010-01-01 00:15:00 0.0 131.25 3.150 3.15 99.0 99.0 99.0 999.0 1019.10 13.375 999.0 12.475 99.0 99.0
2010-01-01 00:30:00 0.0 128.50 3.300 3.30 99.0 99.0 99.0 999.0 1019.10 13.450 999.0 12.550 99.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2010-12-31 22:30:00 0.0 151.00 11.050 11.90 99.0 99.0 99.0 999.0 1012.70 20.500 999.0 19.600 99.0 99.0
2010-12-31 22:45:00 0.0 150.50 10.625 11.40 99.0 99.0 99.0 999.0 1012.75 20.450 999.0 19.600 99.0 99.0
2010-12-31 23:00:00 0.0 150.00 10.200 10.90 99.0 99.0 99.0 999.0 1012.80 20.400 999.0 19.600 99.0 99.0

35037 rows × 14 columns

The codes for time period/frequency are available and are presented here for convenience:

Alias   Description
B   business day frequency
C   custom business day frequency (experimental)
D   calendar day frequency
W   weekly frequency
M   month end frequency
SM  semi-month end frequency (15th and end of month)
BM  business month end frequency
CBM custom business month end frequency
MS  month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS    custom business month start frequency
Q   quarter end frequency
BQ  business quarter endfrequency
QS  quarter start frequency
BQS business quarter start frequency
A   year end frequency
BA  business year end frequency
AS  year start frequency
BAS business year start frequency
BH  business hour frequency
H   hourly frequency
T, min  minutely frequency
S   secondly frequency
L, ms   milliseconds
U, us   microseconds
N   nanoseconds

Exercise

We looked at NYC taxi trip distance earlier, but it was hard to tell what was going on with so much data. Resample this high resolution data to be lower resolution so that any trends in the information are easier to see. By what method do you want to do this downsampling? Plot your results.



In [ ]:

groupby and difference between groupby and resampling

groupby allows us to aggregate data across a category or value. We'll use the example of grouping across a measure of time.

Let's examine this further using a dataset of some water properties near the Flower Garden Banks in Texas. We want to find the average salinity by month across the years of data available, that is, we want to know the average salinity value for each month of the year, calculated for each month from all of the years of data available. We will end up with 12 data points in this case.

This is distinct from resampling for which if you calculate the average salinity by month, you will get a data point for each month in the time series. If there are 5 years of data in your dataset, you will end up with 12*5=60 data points total.

In the groupby example below, we first read the data into dataframe 'df3', then we group it by month (across years, since there are many years of data). From this grouping, we decide what function we want to apply to all of the numbers we've aggregated across the months of the year. We'll use mean for this example.


In [40]:
df3 = pd.read_table('http://pong.tamu.edu/tabswebsite/daily/tabs_V_salt_all', index_col=0, parse_dates=True)
df3


Out[40]:
WaterT [deg C] Cond [ms/cm] Salinity Density [kg/m^3] SoundVel [m/s]
Dates [UTC]
2006-08-25 00:00:00 32.52 -0.01 0.01 995.36 100.0
2006-08-25 00:30:00 31.25 -0.01 0.01 995.73 -999.0
2006-08-25 01:00:00 30.35 -0.01 0.01 995.99 100.0
... ... ... ... ... ...
2019-04-29 10:00:00 22.97 0.22 0.11 997.89 -999.0
2019-04-29 10:30:00 22.90 0.22 0.11 997.91 -999.0
2019-04-29 11:00:00 23.04 0.22 0.11 997.88 -999.0

149849 rows × 5 columns


In [41]:
ax = df3.groupby(df3.index.month).aggregate(np.mean)['Salinity'].plot(color='k', grid=True, figsize=(14, 4), marker='o')

# the x axis is now showing month of the year, which is what we aggregated over
ax.set_xlabel('Month of year')

ax.set_ylabel('Average salinity')


Out[41]:
Text(0, 0.5, 'Average salinity')

Exercise

Calculate and plot the monthly average of salinity from df3 but find this in time, not across different years (so use resampling).



In [ ]:


Exercise

See how far you can get in this exercise to see how you might access data in real life.

  1. Some of NOAA's data is available really easily online. You can look at the meteorological data from a buoy at a website like this, for buoy 8770475. You can download the data there or look at plots. Check out the website.

  2. You can also directly download data once you know what web address to use to access that data. This can be really useful when you want to automate the process of downloading data instead of having to click around for it, and it matters more the more downloading you want to do. You can access the data from this buoy from January 1st to January 14th, 2016, with the following web address. That means that you can put this dynamic link directly into a call with pandas to read in data. Read in this buoy data to a dataframe so that the indices are datetime objects.

     url = 'https://tidesandcurrents.noaa.gov/cgi-bin/newdata.cgi?type=met&id=8770475&begin=20160101&end=20160114&units=metric&timezone=GMT&mode=csv&interval=6'
  3. Now, read in data from buoy 8775237 from October 1 to 9, 2017. What is the url you should use to do this?

  4. After you have the data set read in properly, plot wind speed vs. time with wind speed on the left y-axis, and on the same axes plot wind direction vs. time with wind direction on the right hand y-axis.

Note: where did I get this url from so that I could download the data directly?



In [ ]: