1D analysis: `pandas`!



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
pd.set_option('max_rows', 6)  # max number of rows to show in this notebook — to save space!

import seaborn as sns  # for better style in plots

Reading in data to a dataframe

For 1D analysis, we are generally thinking about data that varies in time, so time series analysis. The pandas package is particularly suited to deal with this type of data, having very convenient methods for interpreting, searching through, and using time representations.

Let's start with the example we started the class with: taxi rides in New York City.



In [2]:

    
df = pd.read_csv('../data/yellow_tripdata_2016-05-01_decimated.csv', parse_dates=[0, 2], index_col=[0])

What do all these (and other) input keyword arguments do?

header: tells which row of the data file is the header, from which it will extract column names
parse_dates: try to interpret the values in [col] or [[col1, col2]] as dates, to convert them into datetime objects.
index_col: if no index column is given, an index counting from 0 is given to the rows. By inputting index_col=[column integer], that column will be used as the index instead. This is usually done with the time information for the dataset.
skiprows: can skip specific rows, skiprows=[list of rows to skip numbered from start of file with 0], or number of rows to skip, skiprows=N.

We can check to make sure the date/time information has been read in as the index, which allows us to reference the other columns using this time information really easily:



In [3]:

    
df.index









    Out[3]:





DatetimeIndex(['2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               ...
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00'],
              dtype='datetime64[ns]', name='tpep_pickup_datetime', length=38211, freq=None)

From this we see that the index is indeed using the timing information in the file, and we can see that the dtype is datetime.

Selecting rows and columns of data

In particular, we will select rows based on the index. Since in this example we are indexing by time, we can use human-readable notation to select based on date/times themselves instead of index. Columns can be selected by name.

We can now access the columns of the file using dictionary-like keyword arguments, like so:



In [4]:

    
df['trip_distance']









    Out[4]:





tpep_pickup_datetime
2016-05-01 00:00:00     3.60
2016-05-01 00:00:00    18.41
2016-05-01 00:00:00     1.60
                       ...  
2016-05-01 23:59:00    19.80
2016-05-01 23:59:00     0.50
2016-05-01 23:59:00     4.05
Name: trip_distance, Length: 38211, dtype: float64

We can equivalently access the columns of data as if they are methods. This means that we can use tab autocomplete to see methods and data available in a dataframe.



In [5]:

    
df.trip_distance









    Out[5]:





tpep_pickup_datetime
2016-05-01 00:00:00     3.60
2016-05-01 00:00:00    18.41
2016-05-01 00:00:00     1.60
                       ...  
2016-05-01 23:59:00    19.80
2016-05-01 23:59:00     0.50
2016-05-01 23:59:00     4.05
Name: trip_distance, Length: 38211, dtype: float64

We can plot in this way, too:



In [6]:

    
df['trip_distance'].plot(figsize=(14,6))









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2856751f98>

Simple data selection

One of the biggest benefits of using pandas is being able to easily reference the data in intuitive ways. For example, because we set up the index of the dataframe to be the date and time, we can pull out data using dates. In the following, we pull out all data from the first hour of the day:



In [7]:

    
df['2016-05-01 00']









    Out[7]:







  
    
      
      VendorID
      tpep_dropoff_datetime
      passenger_count
      trip_distance
      pickup_longitude
      pickup_latitude
      RatecodeID
      store_and_fwd_flag
      dropoff_longitude
      dropoff_latitude
      payment_type
      fare_amount
      extra
      mta_tax
      tip_amount
      tolls_amount
      improvement_surcharge
      total_amount
    
    
      tpep_pickup_datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2016-05-01 00:00:00
      1
      2016-05-01 00:17:00
      1
      3.60
      -73.985901
      40.768040
      1
      N
      -73.983986
      40.730099
      1
      15.0
      0.5
      0.5
      1.50
      0.0
      0.3
      17.80
    
    
      2016-05-01 00:00:00
      2
      2016-05-01 00:00:00
      2
      18.41
      -73.781059
      40.645020
      2
      N
      -73.984573
      40.768398
      2
      52.0
      0.0
      0.5
      0.00
      0.0
      0.3
      52.80
    
    
      2016-05-01 00:00:00
      2
      2016-05-01 00:08:00
      1
      1.60
      -74.002769
      40.728619
      1
      N
      -73.984161
      40.737438
      1
      8.0
      0.5
      0.5
      1.86
      0.0
      0.3
      11.16
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2016-05-01 00:59:00
      1
      2016-05-01 01:09:00
      2
      2.30
      -73.999451
      40.728214
      1
      N
      -73.997505
      40.745892
      2
      9.0
      0.5
      0.5
      0.00
      0.0
      0.3
      10.30
    
    
      2016-05-01 00:59:00
      2
      2016-05-01 01:14:00
      5
      3.58
      -74.001373
      40.761669
      1
      N
      -73.967537
      40.791241
      1
      14.5
      0.5
      0.5
      3.16
      0.0
      0.3
      18.96
    
    
      2016-05-01 00:59:00
      2
      2016-05-01 01:08:00
      1
      1.70
      -73.986198
      40.752426
      1
      N
      -74.003548
      40.740318
      1
      8.0
      0.5
      0.5
      2.79
      0.0
      0.3
      12.09
    
  

2447 rows × 18 columns

Here we further subdivide to examine the passenger count during that time period:



In [8]:

    
df['passenger_count']['2016-05-01 00']









    Out[8]:





tpep_pickup_datetime
2016-05-01 00:00:00    1
2016-05-01 00:00:00    2
2016-05-01 00:00:00    1
                      ..
2016-05-01 00:59:00    2
2016-05-01 00:59:00    5
2016-05-01 00:59:00    1
Name: passenger_count, Length: 2447, dtype: int64

We can also access a range of data, for example any data rows from midnight until noon:



In [9]:

    
df['2016-05-01 00':'2016-05-01 11']









    Out[9]:







  
    
      
      VendorID
      tpep_dropoff_datetime
      passenger_count
      trip_distance
      pickup_longitude
      pickup_latitude
      RatecodeID
      store_and_fwd_flag
      dropoff_longitude
      dropoff_latitude
      payment_type
      fare_amount
      extra
      mta_tax
      tip_amount
      tolls_amount
      improvement_surcharge
      total_amount
    
    
      tpep_pickup_datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2016-05-01 00:00:00
      1
      2016-05-01 00:17:00
      1
      3.60
      -73.985901
      40.768040
      1
      N
      -73.983986
      40.730099
      1
      15.0
      0.5
      0.5
      1.50
      0.00
      0.3
      17.80
    
    
      2016-05-01 00:00:00
      2
      2016-05-01 00:00:00
      2
      18.41
      -73.781059
      40.645020
      2
      N
      -73.984573
      40.768398
      2
      52.0
      0.0
      0.5
      0.00
      0.00
      0.3
      52.80
    
    
      2016-05-01 00:00:00
      2
      2016-05-01 00:08:00
      1
      1.60
      -74.002769
      40.728619
      1
      N
      -73.984161
      40.737438
      1
      8.0
      0.5
      0.5
      1.86
      0.00
      0.3
      11.16
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2016-05-01 11:59:00
      2
      2016-05-01 12:01:00
      5
      0.03
      -73.997009
      40.766052
      1
      N
      -73.996841
      40.766403
      2
      3.0
      0.0
      0.5
      0.00
      0.00
      0.3
      3.80
    
    
      2016-05-01 11:59:00
      1
      2016-05-01 12:15:00
      1
      1.90
      -74.006813
      40.730862
      1
      N
      -73.987183
      40.714912
      1
      11.0
      0.0
      0.5
      2.35
      0.00
      0.3
      14.15
    
    
      2016-05-01 11:59:00
      2
      2016-05-01 12:43:00
      2
      13.63
      -73.977943
      40.762718
      1
      N
      -73.965538
      40.626259
      1
      43.5
      0.0
      0.5
      5.00
      5.54
      0.3
      54.84
    
  

15951 rows × 18 columns

If you want more choice in your selection

The following, adding on minutes, does not work:



In [10]:

    
df['2016-05-01 00:30']









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2889             try:
-> 2890                 return self._engine.get_loc(key)
   2891             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '2016-05-01 00:30'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-fb56de796e39> in <module>
----> 1 df['2016-05-01 00:30']

/opt/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2973             if self.columns.nlevels > 1:
   2974                 return self._getitem_multilevel(key)
-> 2975             indexer = self.columns.get_loc(key)
   2976             if is_integer(indexer):
   2977                 indexer = [indexer]

/opt/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2890                 return self._engine.get_loc(key)
   2891             except KeyError:
-> 2892                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2893         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2894         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '2016-05-01 00:30'

However, we can use another approach to have more control, with .loc to access combinations of specific columns and/or rows, or subsets of columns and/or rows.



In [11]:

    
df.loc['2016-05-01 00:30']









    Out[11]:







  
    
      
      VendorID
      tpep_dropoff_datetime
      passenger_count
      trip_distance
      pickup_longitude
      pickup_latitude
      RatecodeID
      store_and_fwd_flag
      dropoff_longitude
      dropoff_latitude
      payment_type
      fare_amount
      extra
      mta_tax
      tip_amount
      tolls_amount
      improvement_surcharge
      total_amount
    
    
      tpep_pickup_datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2016-05-01 00:30:00
      1
      2016-05-01 00:41:00
      1
      3.10
      -73.984566
      40.752789
      1
      N
      -73.954048
      40.780731
      1
      11.5
      0.5
      0.5
      2.55
      0.0
      0.3
      15.35
    
    
      2016-05-01 00:30:00
      2
      2016-05-01 00:37:00
      1
      1.34
      -73.987144
      40.760605
      1
      N
      -73.990891
      40.750931
      1
      7.0
      0.5
      0.5
      1.66
      0.0
      0.3
      9.96
    
    
      2016-05-01 00:30:00
      1
      2016-05-01 00:35:00
      3
      0.80
      -73.994965
      40.727703
      1
      N
      -74.000732
      40.732288
      1
      5.5
      0.5
      0.5
      1.70
      0.0
      0.3
      8.50
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2016-05-01 00:30:00
      1
      2016-05-01 00:35:00
      1
      0.70
      -74.000664
      40.737263
      1
      N
      -74.000732
      40.728989
      1
      5.0
      0.5
      0.5
      1.25
      0.0
      0.3
      7.55
    
    
      2016-05-01 00:30:00
      2
      2016-05-01 00:44:00
      2
      5.51
      -73.993782
      40.712341
      1
      N
      -73.958893
      40.768017
      1
      18.0
      0.5
      0.5
      2.00
      0.0
      0.3
      21.30
    
    
      2016-05-01 00:30:00
      2
      2016-05-01 00:40:00
      2
      1.05
      -73.989548
      40.747219
      1
      N
      -73.987579
      40.738537
      1
      7.0
      0.5
      0.5
      1.00
      0.0
      0.3
      9.30
    
  

41 rows × 18 columns

You can also select data for more specific time periods.

df.loc[row_label, col_label]



In [12]:

    
df.loc['2016-05-01 00:30', 'passenger_count']









    Out[12]:





tpep_pickup_datetime
2016-05-01 00:30:00    1
2016-05-01 00:30:00    1
2016-05-01 00:30:00    3
                      ..
2016-05-01 00:30:00    1
2016-05-01 00:30:00    2
2016-05-01 00:30:00    2
Name: passenger_count, Length: 41, dtype: int64

You can select more than one column:



In [13]:

    
df.loc['2016-05-01 00:30', ['passenger_count','trip_distance']]









    Out[13]:







  
    
      
      passenger_count
      trip_distance
    
    
      tpep_pickup_datetime
      
      
    
  
  
    
      2016-05-01 00:30:00
      1
      3.10
    
    
      2016-05-01 00:30:00
      1
      1.34
    
    
      2016-05-01 00:30:00
      3
      0.80
    
    
      ...
      ...
      ...
    
    
      2016-05-01 00:30:00
      1
      0.70
    
    
      2016-05-01 00:30:00
      2
      5.51
    
    
      2016-05-01 00:30:00
      2
      1.05
    
  

41 rows × 2 columns

You can select a range of data:



In [14]:

    
df.loc['2016-05-01 00:30':'2016-05-01 01:30', ['passenger_count','trip_distance']]









    Out[14]:







  
    
      
      passenger_count
      trip_distance
    
    
      tpep_pickup_datetime
      
      
    
  
  
    
      2016-05-01 00:30:00
      1
      3.10
    
    
      2016-05-01 00:30:00
      1
      1.34
    
    
      2016-05-01 00:30:00
      3
      0.80
    
    
      ...
      ...
      ...
    
    
      2016-05-01 01:30:00
      2
      0.70
    
    
      2016-05-01 01:30:00
      1
      1.46
    
    
      2016-05-01 01:30:00
      1
      1.64
    
  

2311 rows × 2 columns

You can alternatively select data by index instead of by label, using iloc instead of loc. Here we select the first 5 rows of data for all columns:



In [15]:

    
df.iloc[0:5, :]









    Out[15]:







  
    
      
      VendorID
      tpep_dropoff_datetime
      passenger_count
      trip_distance
      pickup_longitude
      pickup_latitude
      RatecodeID
      store_and_fwd_flag
      dropoff_longitude
      dropoff_latitude
      payment_type
      fare_amount
      extra
      mta_tax
      tip_amount
      tolls_amount
      improvement_surcharge
      total_amount
    
    
      tpep_pickup_datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2016-05-01
      1
      2016-05-01 00:17:00
      1
      3.60
      -73.985901
      40.768040
      1
      N
      -73.983986
      40.730099
      1
      15.0
      0.5
      0.5
      1.50
      0.0
      0.3
      17.80
    
    
      2016-05-01
      2
      2016-05-01 00:00:00
      2
      18.41
      -73.781059
      40.645020
      2
      N
      -73.984573
      40.768398
      2
      52.0
      0.0
      0.5
      0.00
      0.0
      0.3
      52.80
    
    
      2016-05-01
      2
      2016-05-01 00:08:00
      1
      1.60
      -74.002769
      40.728619
      1
      N
      -73.984161
      40.737438
      1
      8.0
      0.5
      0.5
      1.86
      0.0
      0.3
      11.16
    
    
      2016-05-01
      1
      2016-05-01 00:19:00
      1
      1.90
      -73.987854
      40.749043
      1
      N
      -73.992203
      40.725452
      1
      13.0
      0.5
      0.5
      2.85
      0.0
      0.3
      17.15
    
    
      2016-05-01
      2
      2016-05-01 00:10:00
      1
      2.37
      -74.006760
      40.716454
      1
      N
      -73.989731
      40.746796
      1
      10.0
      0.5
      0.5
      2.26
      0.0
      0.3
      13.56

Exercise

Access the data from dataframe df for the last three hours of the day at once. Plot the tip amount (tip_amount) for this time period.

After you can make a line plot, try making a histogram of the data. Play around with the data range and the number of bins. A number of plot types are available built-in to a pandas dataframe inside the plot method under the keyword argument kind.



In [ ]:

Exercise

Using pandas, read in the CTD data we've used in class several times. What variable would make sense to use for your index column?



In [ ]:

Notes about datetimes

You can change the format of datetimes using strftime().

Compare the datetimes in our dataframe index in the first cell below with the second cell, in which we format the look of the datetimes differently. We can choose how it looks using formatting codes. You can find a comprehensive list of the formatting directives at http://strftime.org/. Note that inside the parentheses, you can write other characters that will be passed through (like the comma in the example below).



In [16]:

    
df = pd.read_csv('../data/yellow_tripdata_2016-05-01_decimated.csv', parse_dates=[0, 2], index_col=[0])
df.index









    Out[16]:





DatetimeIndex(['2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               '2016-05-01 00:00:00', '2016-05-01 00:00:00',
               ...
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00',
               '2016-05-01 23:59:00', '2016-05-01 23:59:00'],
              dtype='datetime64[ns]', name='tpep_pickup_datetime', length=38211, freq=None)



In [17]:

    
df.index.strftime('%b %d, %Y %H:%m')









    Out[17]:





Index(['May 01, 2016 00:05', 'May 01, 2016 00:05', 'May 01, 2016 00:05',
       'May 01, 2016 00:05', 'May 01, 2016 00:05', 'May 01, 2016 00:05',
       'May 01, 2016 00:05', 'May 01, 2016 00:05', 'May 01, 2016 00:05',
       'May 01, 2016 00:05',
       ...
       'May 01, 2016 23:05', 'May 01, 2016 23:05', 'May 01, 2016 23:05',
       'May 01, 2016 23:05', 'May 01, 2016 23:05', 'May 01, 2016 23:05',
       'May 01, 2016 23:05', 'May 01, 2016 23:05', 'May 01, 2016 23:05',
       'May 01, 2016 23:05'],
      dtype='object', length=38211)

You can create and use datetimes using pandas. It will interpret the information you put into a string as best it can. Year-month-day is a good way to put in dates instead of using either American or European-specific ordering.

After defining a pandas Timestamp, you can also change time using Timedelta.



In [18]:

    
now = pd.Timestamp('October 22, 2019 1:19PM')
now









    Out[18]:





Timestamp('2019-10-22 13:19:00')



In [19]:

    
tomorrow = pd.Timedelta('1 day')
now + tomorrow









    Out[19]:





Timestamp('2019-10-23 13:19:00')

You can set up a range of datetimes to make your own data frame indices with the following. Codes for frequency are available.



In [20]:

    
pd.date_range(start='Jan 1 2019', end='May 1 2019', freq='15T')









    Out[20]:





DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:15:00',
               '2019-01-01 00:30:00', '2019-01-01 00:45:00',
               '2019-01-01 01:00:00', '2019-01-01 01:15:00',
               '2019-01-01 01:30:00', '2019-01-01 01:45:00',
               '2019-01-01 02:00:00', '2019-01-01 02:15:00',
               ...
               '2019-04-30 21:45:00', '2019-04-30 22:00:00',
               '2019-04-30 22:15:00', '2019-04-30 22:30:00',
               '2019-04-30 22:45:00', '2019-04-30 23:00:00',
               '2019-04-30 23:15:00', '2019-04-30 23:30:00',
               '2019-04-30 23:45:00', '2019-05-01 00:00:00'],
              dtype='datetime64[ns]', length=11521, freq='15T')

Note that you can get many different measures of your time index.



In [21]:

    
df.index.minute









    Out[21]:





Int64Index([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
            ...
            59, 59, 59, 59, 59, 59, 59, 59, 59, 59],
           dtype='int64', name='tpep_pickup_datetime', length=38211)



In [22]:

    
df.index.dayofweek









    Out[22]:





Int64Index([6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            ...
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64', name='tpep_pickup_datetime', length=38211)

Exercise

How would you change the call to strftime above to format all of the indices such that the first index, for example, would be "the 1st of May, 2016 at the hour of 00 and the minute of 00 and the seconds of 00, which is the following day of the week: Sunday." Use the format codes for as many of the values as possible.



In [ ]:

Adding column to dataframe

We can add data to our dataframe very easily. Below we add an index that gives the minute in the hour throughout the day.



In [23]:

    
df['tip squared'] = df.tip_amount**2  # making up some numbers to save to a new column
df['tip squared'].plot()









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2853f52b00>

Another example: Wind data

Let's read in the wind data file that we have used before to have another data set to use. Note the parameters used to read it in properly.



In [24]:

    
df2 = pd.read_table('../data/burl1h2010.txt', header=0, skiprows=[1], delim_whitespace=True, 
                    parse_dates={'dates': ['#YY', 'MM', 'DD', 'hh']},  index_col=0)
df2









    Out[24]:







  
    
      
      mm
      WDIR
      WSPD
      GST
      WVHT
      DPD
      APD
      MWD
      PRES
      ATMP
      WTMP
      DEWP
      VIS
      TIDE
    
    
      dates
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2010-01-01 00:00:00
      0
      134
      3.0
      3.0
      99.0
      99.0
      99.0
      999
      1019.1
      13.3
      999.0
      12.4
      99.0
      99.0
    
    
      2010-01-01 01:00:00
      0
      123
      3.6
      3.6
      99.0
      99.0
      99.0
      999
      1019.1
      13.6
      999.0
      12.7
      99.0
      99.0
    
    
      2010-01-01 02:00:00
      0
      130
      3.3
      3.3
      99.0
      99.0
      99.0
      999
      1019.7
      13.6
      999.0
      12.9
      99.0
      99.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2010-12-31 21:00:00
      0
      147
      11.4
      11.8
      99.0
      99.0
      99.0
      999
      1012.4
      20.4
      999.0
      19.2
      99.0
      99.0
    
    
      2010-12-31 22:00:00
      0
      152
      11.9
      12.9
      99.0
      99.0
      99.0
      999
      1012.6
      20.6
      999.0
      19.6
      99.0
      99.0
    
    
      2010-12-31 23:00:00
      0
      150
      10.2
      10.9
      99.0
      99.0
      99.0
      999
      1012.8
      20.4
      999.0
      19.6
      99.0
      99.0
    
  

8755 rows × 14 columns



In [25]:

    
df2.index









    Out[25]:





DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:00:00',
               '2010-01-01 02:00:00', '2010-01-01 03:00:00',
               '2010-01-01 04:00:00', '2010-01-01 05:00:00',
               '2010-01-01 06:00:00', '2010-01-01 07:00:00',
               '2010-01-01 08:00:00', '2010-01-01 09:00:00',
               ...
               '2010-12-31 14:00:00', '2010-12-31 15:00:00',
               '2010-12-31 16:00:00', '2010-12-31 17:00:00',
               '2010-12-31 18:00:00', '2010-12-31 19:00:00',
               '2010-12-31 20:00:00', '2010-12-31 21:00:00',
               '2010-12-31 22:00:00', '2010-12-31 23:00:00'],
              dtype='datetime64[ns]', name='dates', length=8755, freq=None)

Plotting with `pandas`

You can plot with matplotlib and control many things directly from pandas. Get more info about plotting from pandas dataframes directly from:



In [26]:

    
df.plot?

You can mix and match plotting with matplotlib by either setting up a figure and axes you want to use with calls to plot from your dataframe (which you input to the plot call), or you can start with a pandas plot and save an axes from that call. Each will be demonstrated next. Or, you can bring the pandas data to matplotlib fully.

Start from `matplotlib`, then input axes to `pandas`

To demonstrate plotting starting from matplotlib, we will also demonstrate a note about column selection for plotting. You can select which data columns to plot either by selecting in the line before the plot call, or you can choose the columns within the plot call.

The key part here is that you input to your pandas plot call the axes you wanted plotted into (here: ax=axes[0]).



In [27]:

    
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14,4))

df2['WSPD']['2010-5'].plot(ax=axes[0])
df2.loc['2010-5'].plot(y='WSPD', ax=axes[1])









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f28533af550>

Start with `pandas`, then use `matplotlib` commands

The important part here is that the call to pandas dataframe plotting returns an axes handle which you can save; here, it is saved as "ax".



In [28]:

    
ax = df2['WSPD']['2010 11 1'].plot()
ax.set_ylabel('Wind speed')









    Out[28]:





Text(0, 0.5, 'Wind speed')

Bring `pandas` dataframe data to `matplotlib` fully

You can also use matplotlib directly by pulling the data you want to plot out of your dataframe.



In [29]:

    
plt.plot(df2['WSPD'])









    Out[29]:





[<matplotlib.lines.Line2D at 0x7f2851a4a4a8>]

Plot all or multiple columns at once



In [30]:

    
# all
df2.plot()









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f28519b30f0>

To plot more than one but less than all columns, give a list of column names. Here are two ways to do the same thing:



In [31]:

    
# multiple
fig, axes = plt.subplots(1, 2, figsize=(14,4))
df2[['WSPD', 'GST']].plot(ax=axes[0])
df2.plot(y=['WSPD', 'GST'], ax=axes[1])









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2851886c50>

Formatting dates

You can control how datetimes look on the x axis in these plots as demonstrated in this section. The formatting codes used in the call to DateFormatter are the same as those used above in this notebook for strftime.

Note that you can also control all of this with minor ticks additionally.



In [32]:

    
ax = df2['WSPD'].plot(figsize=(14,4))



In [33]:

    
from matplotlib.dates import DateFormatter

ax = df2['WSPD'].plot(figsize=(14,4))
ax.set_xlabel('2010')

date_form = DateFormatter("%b %d")
ax.xaxis.set_major_formatter(date_form)

# import matplotlib.dates as mdates
# # You can also control where the ticks are located, by date with Locators
# ticklocations = mdates.MonthLocator()
# ax.xaxis.set_major_locator(ticklocations)

Plotting with twin axis

You can very easily plot two variables with different y axis limits with the secondary_y keyword argument to df.plot.



In [34]:

    
axleft = df2['WSPD']['2010-10'].plot(figsize=(14,4))
axright = df2['WDIR']['2010-10'].plot(secondary_y=True, alpha=0.5)
axleft.set_ylabel('Speed [m/s]', color='blue');
axright.set_ylabel('Dir [degrees]', color='orange');

Resampling

Sometimes we want our data to be at a different sampling frequency that we have, that is, we want to change the time between rows or observations. Changing this is called resampling. We can upsample to increase the number of data points in a given dataset (or decrease the period between points) or we can downsample to decrease the number of data points.

The wind data is given every hour. Here we downsample it to be once a day instead. After the resample function, a method needs to be used for how to combine the data over the downsampling period since the existing data needs to be combined in some way. We could use the max value over the 1-day period to represent each day:



In [35]:

    
df2.resample('1d').max()    #['DEWP']  # now the data is daily









    Out[35]:







  
    
      
      mm
      WDIR
      WSPD
      GST
      WVHT
      DPD
      APD
      MWD
      PRES
      ATMP
      WTMP
      DEWP
      VIS
      TIDE
    
    
      dates
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2010-01-01
      0
      357
      14.0
      15.6
      99.0
      99.0
      99.0
      999
      1025.5
      13.6
      999.0
      12.9
      99.0
      99.0
    
    
      2010-01-02
      0
      33
      10.6
      11.3
      99.0
      99.0
      99.0
      999
      1028.8
      9.9
      999.0
      4.5
      99.0
      99.0
    
    
      2010-01-03
      0
      54
      10.1
      10.9
      99.0
      99.0
      99.0
      999
      1025.7
      8.4
      999.0
      3.3
      99.0
      99.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2010-12-29
      0
      170
      11.4
      12.1
      99.0
      99.0
      99.0
      999
      1024.5
      18.1
      999.0
      12.8
      99.0
      99.0
    
    
      2010-12-30
      0
      322
      14.9
      16.1
      99.0
      99.0
      99.0
      999
      1014.8
      20.3
      999.0
      18.1
      99.0
      99.0
    
    
      2010-12-31
      0
      152
      12.0
      13.8
      99.0
      99.0
      99.0
      999
      1015.6
      20.7
      999.0
      19.6
      99.0
      99.0
    
  

365 rows × 14 columns

It's always important to check our results to make sure they look reasonable. Let's plot our resampled data with the original data to make sure they align well. We'll choose one variable for this check.

We can see that the daily max wind gust does indeed look like the max value for each day, though note that it is plotted at the start of the day.



In [36]:

    
df2['GST']['2010-4-1':'2010-4-5'].plot()
df2.resample('1d').max()['GST']['2010-4-1':'2010-4-5'].plot()









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f285141b080>

We can also upsample our data or add more rows of data. Note that like before, after we resample our data we still need a method on the end telling pandas how to process the data. However, since in this case we are not combining data (downsampling) but are adding more rows (upsampling), using a function like max doesn't change the existing observations (taking the max of a single row). For the new rows, we haven't said how to fill them so they are nan's by default.

Here we are changing from having data every hour to having it every 30 minutes.



In [37]:

    
df2.resample('30min').max()  # max doesn't say what to do with data in new rows









    Out[37]:







  
    
      
      mm
      WDIR
      WSPD
      GST
      WVHT
      DPD
      APD
      MWD
      PRES
      ATMP
      WTMP
      DEWP
      VIS
      TIDE
    
    
      dates
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2010-01-01 00:00:00
      0.0
      134.0
      3.0
      3.0
      99.0
      99.0
      99.0
      999.0
      1019.1
      13.3
      999.0
      12.4
      99.0
      99.0
    
    
      2010-01-01 00:30:00
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2010-01-01 01:00:00
      0.0
      123.0
      3.6
      3.6
      99.0
      99.0
      99.0
      999.0
      1019.1
      13.6
      999.0
      12.7
      99.0
      99.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2010-12-31 22:00:00
      0.0
      152.0
      11.9
      12.9
      99.0
      99.0
      99.0
      999.0
      1012.6
      20.6
      999.0
      19.6
      99.0
      99.0
    
    
      2010-12-31 22:30:00
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2010-12-31 23:00:00
      0.0
      150.0
      10.2
      10.9
      99.0
      99.0
      99.0
      999.0
      1012.8
      20.4
      999.0
      19.6
      99.0
      99.0
    
  

17519 rows × 14 columns

When upsampling, a reasonable option is to fill the new rows with data from the previous existing row:



In [38]:

    
df2.resample('30min').ffill()









    Out[38]:







  
    
      
      mm
      WDIR
      WSPD
      GST
      WVHT
      DPD
      APD
      MWD
      PRES
      ATMP
      WTMP
      DEWP
      VIS
      TIDE
    
    
      dates
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2010-01-01 00:00:00
      0
      134
      3.0
      3.0
      99.0
      99.0
      99.0
      999
      1019.1
      13.3
      999.0
      12.4
      99.0
      99.0
    
    
      2010-01-01 00:30:00
      0
      134
      3.0
      3.0
      99.0
      99.0
      99.0
      999
      1019.1
      13.3
      999.0
      12.4
      99.0
      99.0
    
    
      2010-01-01 01:00:00
      0
      123
      3.6
      3.6
      99.0
      99.0
      99.0
      999
      1019.1
      13.6
      999.0
      12.7
      99.0
      99.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2010-12-31 22:00:00
      0
      152
      11.9
      12.9
      99.0
      99.0
      99.0
      999
      1012.6
      20.6
      999.0
      19.6
      99.0
      99.0
    
    
      2010-12-31 22:30:00
      0
      152
      11.9
      12.9
      99.0
      99.0
      99.0
      999
      1012.6
      20.6
      999.0
      19.6
      99.0
      99.0
    
    
      2010-12-31 23:00:00
      0
      150
      10.2
      10.9
      99.0
      99.0
      99.0
      999
      1012.8
      20.4
      999.0
      19.6
      99.0
      99.0
    
  

17519 rows × 14 columns

Here we upsample to have data every 15 minutes, but we interpolate to fill in the data between. This is a very useful thing to be able to do.



In [39]:

    
df2.resample('15 T').interpolate()









    Out[39]:







  
    
      
      mm
      WDIR
      WSPD
      GST
      WVHT
      DPD
      APD
      MWD
      PRES
      ATMP
      WTMP
      DEWP
      VIS
      TIDE
    
    
      dates
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2010-01-01 00:00:00
      0.0
      134.00
      3.000
      3.00
      99.0
      99.0
      99.0
      999.0
      1019.10
      13.300
      999.0
      12.400
      99.0
      99.0
    
    
      2010-01-01 00:15:00
      0.0
      131.25
      3.150
      3.15
      99.0
      99.0
      99.0
      999.0
      1019.10
      13.375
      999.0
      12.475
      99.0
      99.0
    
    
      2010-01-01 00:30:00
      0.0
      128.50
      3.300
      3.30
      99.0
      99.0
      99.0
      999.0
      1019.10
      13.450
      999.0
      12.550
      99.0
      99.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2010-12-31 22:30:00
      0.0
      151.00
      11.050
      11.90
      99.0
      99.0
      99.0
      999.0
      1012.70
      20.500
      999.0
      19.600
      99.0
      99.0
    
    
      2010-12-31 22:45:00
      0.0
      150.50
      10.625
      11.40
      99.0
      99.0
      99.0
      999.0
      1012.75
      20.450
      999.0
      19.600
      99.0
      99.0
    
    
      2010-12-31 23:00:00
      0.0
      150.00
      10.200
      10.90
      99.0
      99.0
      99.0
      999.0
      1012.80
      20.400
      999.0
      19.600
      99.0
      99.0
    
  

35037 rows × 14 columns

The codes for time period/frequency are available and are presented here for convenience:

Alias   Description
B   business day frequency
C   custom business day frequency (experimental)
D   calendar day frequency
W   weekly frequency
M   month end frequency
SM  semi-month end frequency (15th and end of month)
BM  business month end frequency
CBM custom business month end frequency
MS  month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS    custom business month start frequency
Q   quarter end frequency
BQ  business quarter endfrequency
QS  quarter start frequency
BQS business quarter start frequency
A   year end frequency
BA  business year end frequency
AS  year start frequency
BAS business year start frequency
BH  business hour frequency
H   hourly frequency
T, min  minutely frequency
S   secondly frequency
L, ms   milliseconds
U, us   microseconds
N   nanoseconds

Exercise

We looked at NYC taxi trip distance earlier, but it was hard to tell what was going on with so much data. Resample this high resolution data to be lower resolution so that any trends in the information are easier to see. By what method do you want to do this downsampling? Plot your results.



In [ ]:

`groupby` and difference between `groupby` and resampling

groupby allows us to aggregate data across a category or value. We'll use the example of grouping across a measure of time.

Let's examine this further using a dataset of some water properties near the Flower Garden Banks in Texas. We want to find the average salinity by month across the years of data available, that is, we want to know the average salinity value for each month of the year, calculated for each month from all of the years of data available. We will end up with 12 data points in this case.

This is distinct from resampling for which if you calculate the average salinity by month, you will get a data point for each month in the time series. If there are 5 years of data in your dataset, you will end up with 12*5=60 data points total.

In the groupby example below, we first read the data into dataframe 'df3', then we group it by month (across years, since there are many years of data). From this grouping, we decide what function we want to apply to all of the numbers we've aggregated across the months of the year. We'll use mean for this example.



In [40]:

    
df3 = pd.read_table('http://pong.tamu.edu/tabswebsite/daily/tabs_V_salt_all', index_col=0, parse_dates=True)
df3









    Out[40]:







  
    
      
      WaterT [deg C]
      Cond [ms/cm]
      Salinity
      Density [kg/m^3]
      SoundVel [m/s]
    
    
      Dates [UTC]
      
      
      
      
      
    
  
  
    
      2006-08-25 00:00:00
      32.52
      -0.01
      0.01
      995.36
      100.0
    
    
      2006-08-25 00:30:00
      31.25
      -0.01
      0.01
      995.73
      -999.0
    
    
      2006-08-25 01:00:00
      30.35
      -0.01
      0.01
      995.99
      100.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2019-04-29 10:00:00
      22.97
      0.22
      0.11
      997.89
      -999.0
    
    
      2019-04-29 10:30:00
      22.90
      0.22
      0.11
      997.91
      -999.0
    
    
      2019-04-29 11:00:00
      23.04
      0.22
      0.11
      997.88
      -999.0
    
  

149849 rows × 5 columns



In [41]:

    
ax = df3.groupby(df3.index.month).aggregate(np.mean)['Salinity'].plot(color='k', grid=True, figsize=(14, 4), marker='o')

# the x axis is now showing month of the year, which is what we aggregated over
ax.set_xlabel('Month of year')

ax.set_ylabel('Average salinity')









    Out[41]:





Text(0, 0.5, 'Average salinity')

Exercise

Calculate and plot the monthly average of salinity from df3 but find this in time, not across different years (so use resampling).



In [ ]:

Exercise

See how far you can get in this exercise to see how you might access data in real life.
Some of NOAA's data is available really easily online. You can look at the meteorological data from a buoy at a website like this, for buoy 8770475. You can download the data there or look at plots. Check out the website.
You can also directly download data once you know what web address to use to access that data. This can be really useful when you want to automate the process of downloading data instead of having to click around for it, and it matters more the more downloading you want to do. You can access the data from this buoy from January 1st to January 14th, 2016, with the following web address. That means that you can put this dynamic link directly into a call with pandas to read in data. Read in this buoy data to a dataframe so that the indices are datetime objects.
 url = 'https://tidesandcurrents.noaa.gov/cgi-bin/newdata.cgi?type=met&id=8770475&begin=20160101&end=20160114&units=metric&timezone=GMT&mode=csv&interval=6'
Now, read in data from buoy 8775237 from October 1 to 9, 2017. What is the url you should use to do this?

After you have the data set read in properly, plot wind speed vs. time with wind speed on the left y-axis, and on the same axes plot wind direction vs. time with wind direction on the right hand y-axis.
Note: where did I get this url from so that I could download the data directly?



In [ ]:

	VendorID	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RatecodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount
tpep_pickup_datetime
2016-05-01 00:00:00	1	2016-05-01 00:17:00	1	3.60	-73.985901	40.768040	1	N	-73.983986	40.730099	1	15.0	0.5	0.5	1.50	0.0	0.3	17.80
2016-05-01 00:00:00	2	2016-05-01 00:00:00	2	18.41	-73.781059	40.645020	2	N	-73.984573	40.768398	2	52.0	0.0	0.5	0.00	0.0	0.3	52.80
2016-05-01 00:00:00	2	2016-05-01 00:08:00	1	1.60	-74.002769	40.728619	1	N	-73.984161	40.737438	1	8.0	0.5	0.5	1.86	0.0	0.3	11.16
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2016-05-01 00:59:00	1	2016-05-01 01:09:00	2	2.30	-73.999451	40.728214	1	N	-73.997505	40.745892	2	9.0	0.5	0.5	0.00	0.0	0.3	10.30
2016-05-01 00:59:00	2	2016-05-01 01:14:00	5	3.58	-74.001373	40.761669	1	N	-73.967537	40.791241	1	14.5	0.5	0.5	3.16	0.0	0.3	18.96
2016-05-01 00:59:00	2	2016-05-01 01:08:00	1	1.70	-73.986198	40.752426	1	N	-74.003548	40.740318	1	8.0	0.5	0.5	2.79	0.0	0.3	12.09

	VendorID	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RatecodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount
tpep_pickup_datetime
2016-05-01 00:30:00	1	2016-05-01 00:41:00	1	3.10	-73.984566	40.752789	1	N	-73.954048	40.780731	1	11.5	0.5	0.5	2.55	0.0	0.3	15.35
2016-05-01 00:30:00	2	2016-05-01 00:37:00	1	1.34	-73.987144	40.760605	1	N	-73.990891	40.750931	1	7.0	0.5	0.5	1.66	0.0	0.3	9.96
2016-05-01 00:30:00	1	2016-05-01 00:35:00	3	0.80	-73.994965	40.727703	1	N	-74.000732	40.732288	1	5.5	0.5	0.5	1.70	0.0	0.3	8.50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2016-05-01 00:30:00	1	2016-05-01 00:35:00	1	0.70	-74.000664	40.737263	1	N	-74.000732	40.728989	1	5.0	0.5	0.5	1.25	0.0	0.3	7.55
2016-05-01 00:30:00	2	2016-05-01 00:44:00	2	5.51	-73.993782	40.712341	1	N	-73.958893	40.768017	1	18.0	0.5	0.5	2.00	0.0	0.3	21.30
2016-05-01 00:30:00	2	2016-05-01 00:40:00	2	1.05	-73.989548	40.747219	1	N	-73.987579	40.738537	1	7.0	0.5	0.5	1.00	0.0	0.3	9.30

	VendorID	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RatecodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount
tpep_pickup_datetime
2016-05-01	1	2016-05-01 00:17:00	1	3.60	-73.985901	40.768040	1	N	-73.983986	40.730099	1	15.0	0.5	0.5	1.50	0.0	0.3	17.80
2016-05-01	2	2016-05-01 00:00:00	2	18.41	-73.781059	40.645020	2	N	-73.984573	40.768398	2	52.0	0.0	0.5	0.00	0.0	0.3	52.80
2016-05-01	2	2016-05-01 00:08:00	1	1.60	-74.002769	40.728619	1	N	-73.984161	40.737438	1	8.0	0.5	0.5	1.86	0.0	0.3	11.16
2016-05-01	1	2016-05-01 00:19:00	1	1.90	-73.987854	40.749043	1	N	-73.992203	40.725452	1	13.0	0.5	0.5	2.85	0.0	0.3	17.15
2016-05-01	2	2016-05-01 00:10:00	1	2.37	-74.006760	40.716454	1	N	-73.989731	40.746796	1	10.0	0.5	0.5	2.26	0.0	0.3	13.56

	mm	WDIR	WSPD	GST	WVHT	DPD	APD	MWD	PRES	ATMP	WTMP	DEWP	VIS	TIDE
dates
2010-01-01 00:00:00	0	134	3.0	3.0	99.0	99.0	99.0	999	1019.1	13.3	999.0	12.4	99.0	99.0
2010-01-01 01:00:00	0	123	3.6	3.6	99.0	99.0	99.0	999	1019.1	13.6	999.0	12.7	99.0	99.0
2010-01-01 02:00:00	0	130	3.3	3.3	99.0	99.0	99.0	999	1019.7	13.6	999.0	12.9	99.0	99.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2010-12-31 21:00:00	0	147	11.4	11.8	99.0	99.0	99.0	999	1012.4	20.4	999.0	19.2	99.0	99.0
2010-12-31 22:00:00	0	152	11.9	12.9	99.0	99.0	99.0	999	1012.6	20.6	999.0	19.6	99.0	99.0
2010-12-31 23:00:00	0	150	10.2	10.9	99.0	99.0	99.0	999	1012.8	20.4	999.0	19.6	99.0	99.0

	mm	WDIR	WSPD	GST	WVHT	DPD	APD	MWD	PRES	ATMP	WTMP	DEWP	VIS	TIDE
dates
2010-01-01	0	357	14.0	15.6	99.0	99.0	99.0	999	1025.5	13.6	999.0	12.9	99.0	99.0
2010-01-02	0	33	10.6	11.3	99.0	99.0	99.0	999	1028.8	9.9	999.0	4.5	99.0	99.0
2010-01-03	0	54	10.1	10.9	99.0	99.0	99.0	999	1025.7	8.4	999.0	3.3	99.0	99.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2010-12-29	0	170	11.4	12.1	99.0	99.0	99.0	999	1024.5	18.1	999.0	12.8	99.0	99.0
2010-12-30	0	322	14.9	16.1	99.0	99.0	99.0	999	1014.8	20.3	999.0	18.1	99.0	99.0
2010-12-31	0	152	12.0	13.8	99.0	99.0	99.0	999	1015.6	20.7	999.0	19.6	99.0	99.0

	mm	WDIR	WSPD	GST	WVHT	DPD	APD	MWD	PRES	ATMP	WTMP	DEWP	VIS	TIDE
dates
2010-01-01 00:00:00	0.0	134.0	3.0	3.0	99.0	99.0	99.0	999.0	1019.1	13.3	999.0	12.4	99.0	99.0
2010-01-01 00:30:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2010-01-01 01:00:00	0.0	123.0	3.6	3.6	99.0	99.0	99.0	999.0	1019.1	13.6	999.0	12.7	99.0	99.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2010-12-31 22:00:00	0.0	152.0	11.9	12.9	99.0	99.0	99.0	999.0	1012.6	20.6	999.0	19.6	99.0	99.0
2010-12-31 22:30:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2010-12-31 23:00:00	0.0	150.0	10.2	10.9	99.0	99.0	99.0	999.0	1012.8	20.4	999.0	19.6	99.0	99.0

	mm	WDIR	WSPD	GST	WVHT	DPD	APD	MWD	PRES	ATMP	WTMP	DEWP	VIS	TIDE
dates
2010-01-01 00:00:00	0.0	134.00	3.000	3.00	99.0	99.0	99.0	999.0	1019.10	13.300	999.0	12.400	99.0	99.0
2010-01-01 00:15:00	0.0	131.25	3.150	3.15	99.0	99.0	99.0	999.0	1019.10	13.375	999.0	12.475	99.0	99.0
2010-01-01 00:30:00	0.0	128.50	3.300	3.30	99.0	99.0	99.0	999.0	1019.10	13.450	999.0	12.550	99.0	99.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2010-12-31 22:30:00	0.0	151.00	11.050	11.90	99.0	99.0	99.0	999.0	1012.70	20.500	999.0	19.600	99.0	99.0
2010-12-31 22:45:00	0.0	150.50	10.625	11.40	99.0	99.0	99.0	999.0	1012.75	20.450	999.0	19.600	99.0	99.0
2010-12-31 23:00:00	0.0	150.00	10.200	10.90	99.0	99.0	99.0	999.0	1012.80	20.400	999.0	19.600	99.0	99.0

	WaterT [deg C]	Cond [ms/cm]	Salinity	Density [kg/m^3]	SoundVel [m/s]
Dates [UTC]
2006-08-25 00:00:00	32.52	-0.01	0.01	995.36	100.0
2006-08-25 00:30:00	31.25	-0.01	0.01	995.73	-999.0
2006-08-25 01:00:00	30.35	-0.01	0.01	995.99	100.0
...	...	...	...	...	...
2019-04-29 10:00:00	22.97	0.22	0.11	997.89	-999.0
2019-04-29 10:30:00	22.90	0.22	0.11	997.91	-999.0
2019-04-29 11:00:00	23.04	0.22	0.11	997.88	-999.0

1D analysis: pandas!

Reading in data to a dataframe

Selecting rows and columns of data

Simple data selection

If you want more choice in your selection

Exercise

Exercise

Notes about datetimes

Exercise

Adding column to dataframe

Another example: Wind data

Plotting with pandas

Start from matplotlib, then input axes to pandas

Start with pandas, then use matplotlib commands

Bring pandas dataframe data to matplotlib fully

Plot all or multiple columns at once

Formatting dates

Plotting with twin axis

Resampling

Exercise

groupby and difference between groupby and resampling

Exercise

Exercise

1D analysis: `pandas`!

Plotting with `pandas`

Start from `matplotlib`, then input axes to `pandas`

Start with `pandas`, then use `matplotlib` commands

Bring `pandas` dataframe data to `matplotlib` fully

`groupby` and difference between `groupby` and resampling