In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
pd.set_option('max_rows', 6) # max number of rows to show in this notebook — to save space!
import seaborn as sns # for better style in plots
For 1D analysis, we are generally thinking about data that varies in time, so time series analysis. The pandas
package is particularly suited to deal with this type of data, having very convenient methods for interpreting, searching through, and using time representations.
Let's start with the example we started the class with: taxi rides in New York City.
In [2]:
df = pd.read_csv('../data/yellow_tripdata_2016-05-01_decimated.csv', parse_dates=[0, 2], index_col=[0])
What do all these (and other) input keyword arguments do?
[col]
or [[col1, col2]]
as dates, to convert them into datetime
objects.index_col=[column integer]
, that column will be used as the index instead. This is usually done with the time information for the dataset.skiprows=[list of rows to skip numbered from start of file with 0]
, or number of rows to skip, skiprows=N
.We can check to make sure the date/time information has been read in as the index, which allows us to reference the other columns using this time information really easily:
In [3]:
df.index
Out[3]:
From this we see that the index is indeed using the timing information in the file, and we can see that the dtype
is datetime
.
We can now access the columns of the file using dictionary-like keyword arguments, like so:
In [4]:
df['trip_distance']
Out[4]:
We can equivalently access the columns of data as if they are methods. This means that we can use tab autocomplete to see methods and data available in a dataframe.
In [5]:
df.trip_distance
Out[5]:
We can plot in this way, too:
In [6]:
df['trip_distance'].plot(figsize=(14,6))
Out[6]:
One of the biggest benefits of using pandas
is being able to easily reference the data in intuitive ways. For example, because we set up the index of the dataframe to be the date and time, we can pull out data using dates. In the following, we pull out all data from the first hour of the day:
In [7]:
df['2016-05-01 00']
Out[7]:
Here we further subdivide to examine the passenger count during that time period:
In [8]:
df['passenger_count']['2016-05-01 00']
Out[8]:
We can also access a range of data, for example any data rows from midnight until noon:
In [9]:
df['2016-05-01 00':'2016-05-01 11']
Out[9]:
In [10]:
df['2016-05-01 00:30']
However, we can use another approach to have more control, with .loc
to access combinations of specific columns and/or rows, or subsets of columns and/or rows.
In [11]:
df.loc['2016-05-01 00:30']
Out[11]:
You can also select data for more specific time periods.
df.loc[row_label, col_label]
In [12]:
df.loc['2016-05-01 00:30', 'passenger_count']
Out[12]:
You can select more than one column:
In [13]:
df.loc['2016-05-01 00:30', ['passenger_count','trip_distance']]
Out[13]:
You can select a range of data:
In [14]:
df.loc['2016-05-01 00:30':'2016-05-01 01:30', ['passenger_count','trip_distance']]
Out[14]:
You can alternatively select data by index instead of by label, using iloc
instead of loc
. Here we select the first 5 rows of data for all columns:
In [15]:
df.iloc[0:5, :]
Out[15]:
Access the data from dataframe
df
for the last three hours of the day at once. Plot the tip amount (tip_amount
) for this time period.After you can make a line plot, try making a histogram of the data. Play around with the data range and the number of bins. A number of
plot
types are available built-in to apandas
dataframe inside theplot
method under the keyword argumentkind
.
In [ ]:
In [ ]:
You can change the format of datetimes using strftime()
.
Compare the datetimes in our dataframe index in the first cell below with the second cell, in which we format the look of the datetimes differently. We can choose how it looks using formatting codes. You can find a comprehensive list of the formatting directives at http://strftime.org/. Note that inside the parentheses, you can write other characters that will be passed through (like the comma in the example below).
In [16]:
df = pd.read_csv('../data/yellow_tripdata_2016-05-01_decimated.csv', parse_dates=[0, 2], index_col=[0])
df.index
Out[16]:
In [17]:
df.index.strftime('%b %d, %Y %H:%m')
Out[17]:
You can create and use datetimes using pandas
. It will interpret the information you put into a string as best it can. Year-month-day is a good way to put in dates instead of using either American or European-specific ordering.
After defining a pandas Timestamp, you can also change time using Timedelta.
In [18]:
now = pd.Timestamp('October 22, 2019 1:19PM')
now
Out[18]:
In [19]:
tomorrow = pd.Timedelta('1 day')
now + tomorrow
Out[19]:
You can set up a range of datetimes to make your own data frame indices with the following. Codes for frequency are available.
In [20]:
pd.date_range(start='Jan 1 2019', end='May 1 2019', freq='15T')
Out[20]:
Note that you can get many different measures of your time index.
In [21]:
df.index.minute
Out[21]:
In [22]:
df.index.dayofweek
Out[22]:
How would you change the call to
strftime
above to format all of the indices such that the first index, for example, would be "the 1st of May, 2016 at the hour of 00 and the minute of 00 and the seconds of 00, which is the following day of the week: Sunday." Use the format codes for as many of the values as possible.
In [ ]:
In [23]:
df['tip squared'] = df.tip_amount**2 # making up some numbers to save to a new column
df['tip squared'].plot()
Out[23]:
In [24]:
df2 = pd.read_table('../data/burl1h2010.txt', header=0, skiprows=[1], delim_whitespace=True,
parse_dates={'dates': ['#YY', 'MM', 'DD', 'hh']}, index_col=0)
df2
Out[24]:
In [25]:
df2.index
Out[25]:
In [26]:
df.plot?
You can mix and match plotting with matplotlib by either setting up a figure and axes you want to use with calls to plot
from your dataframe (which you input to the plot call), or you can start with a pandas plot and save an axes from that call. Each will be demonstrated next. Or, you can bring the pandas data to matplotlib fully.
matplotlib
, then input axes to pandas
To demonstrate plotting starting from matplotlib
, we will also demonstrate a note about column selection for plotting. You can select which data columns to plot either by selecting in the line before the plot
call, or you can choose the columns within the plot call.
The key part here is that you input to your pandas plot call the axes you wanted plotted into (here: ax=axes[0]
).
In [27]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(14,4))
df2['WSPD']['2010-5'].plot(ax=axes[0])
df2.loc['2010-5'].plot(y='WSPD', ax=axes[1])
Out[27]:
In [28]:
ax = df2['WSPD']['2010 11 1'].plot()
ax.set_ylabel('Wind speed')
Out[28]:
In [29]:
plt.plot(df2['WSPD'])
Out[29]:
In [30]:
# all
df2.plot()
Out[30]:
To plot more than one but less than all columns, give a list of column names. Here are two ways to do the same thing:
In [31]:
# multiple
fig, axes = plt.subplots(1, 2, figsize=(14,4))
df2[['WSPD', 'GST']].plot(ax=axes[0])
df2.plot(y=['WSPD', 'GST'], ax=axes[1])
Out[31]:
You can control how datetimes look on the x axis in these plots as demonstrated in this section. The formatting codes used in the call to DateFormatter
are the same as those used above in this notebook for strftime
.
Note that you can also control all of this with minor ticks additionally.
In [32]:
ax = df2['WSPD'].plot(figsize=(14,4))
In [33]:
from matplotlib.dates import DateFormatter
ax = df2['WSPD'].plot(figsize=(14,4))
ax.set_xlabel('2010')
date_form = DateFormatter("%b %d")
ax.xaxis.set_major_formatter(date_form)
# import matplotlib.dates as mdates
# # You can also control where the ticks are located, by date with Locators
# ticklocations = mdates.MonthLocator()
# ax.xaxis.set_major_locator(ticklocations)
In [34]:
axleft = df2['WSPD']['2010-10'].plot(figsize=(14,4))
axright = df2['WDIR']['2010-10'].plot(secondary_y=True, alpha=0.5)
axleft.set_ylabel('Speed [m/s]', color='blue');
axright.set_ylabel('Dir [degrees]', color='orange');
Sometimes we want our data to be at a different sampling frequency that we have, that is, we want to change the time between rows or observations. Changing this is called resampling. We can upsample to increase the number of data points in a given dataset (or decrease the period between points) or we can downsample to decrease the number of data points.
The wind data is given every hour. Here we downsample it to be once a day instead. After the resample
function, a method needs to be used for how to combine the data over the downsampling period since the existing data needs to be combined in some way. We could use the max value over the 1-day period to represent each day:
In [35]:
df2.resample('1d').max() #['DEWP'] # now the data is daily
Out[35]:
It's always important to check our results to make sure they look reasonable. Let's plot our resampled data with the original data to make sure they align well. We'll choose one variable for this check.
We can see that the daily max wind gust does indeed look like the max value for each day, though note that it is plotted at the start of the day.
In [36]:
df2['GST']['2010-4-1':'2010-4-5'].plot()
df2.resample('1d').max()['GST']['2010-4-1':'2010-4-5'].plot()
Out[36]:
We can also upsample our data or add more rows of data. Note that like before, after we resample our data we still need a method on the end telling pandas
how to process the data. However, since in this case we are not combining data (downsampling) but are adding more rows (upsampling), using a function like max
doesn't change the existing observations (taking the max of a single row). For the new rows, we haven't said how to fill them so they are nan's by default.
Here we are changing from having data every hour to having it every 30 minutes.
In [37]:
df2.resample('30min').max() # max doesn't say what to do with data in new rows
Out[37]:
When upsampling, a reasonable option is to fill the new rows with data from the previous existing row:
In [38]:
df2.resample('30min').ffill()
Out[38]:
Here we upsample to have data every 15 minutes, but we interpolate to fill in the data between. This is a very useful thing to be able to do.
In [39]:
df2.resample('15 T').interpolate()
Out[39]:
The codes for time period/frequency are available and are presented here for convenience:
Alias Description
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
In [ ]:
groupby
and difference between groupby
and resamplinggroupby
allows us to aggregate data across a category or value. We'll use the example of grouping across a measure of time.
Let's examine this further using a dataset of some water properties near the Flower Garden Banks in Texas. We want to find the average salinity by month across the years of data available, that is, we want to know the average salinity value for each month of the year, calculated for each month from all of the years of data available. We will end up with 12 data points in this case.
This is distinct from resampling for which if you calculate the average salinity by month, you will get a data point for each month in the time series. If there are 5 years of data in your dataset, you will end up with 12*5=60 data points total.
In the groupby
example below, we first read the data into dataframe 'df3', then we group it by month (across years, since there are many years of data). From this grouping, we decide what function we want to apply to all of the numbers we've aggregated across the months of the year. We'll use mean for this example.
In [40]:
df3 = pd.read_table('http://pong.tamu.edu/tabswebsite/daily/tabs_V_salt_all', index_col=0, parse_dates=True)
df3
Out[40]:
In [41]:
ax = df3.groupby(df3.index.month).aggregate(np.mean)['Salinity'].plot(color='k', grid=True, figsize=(14, 4), marker='o')
# the x axis is now showing month of the year, which is what we aggregated over
ax.set_xlabel('Month of year')
ax.set_ylabel('Average salinity')
Out[41]:
In [ ]:
See how far you can get in this exercise to see how you might access data in real life.
Some of NOAA's data is available really easily online. You can look at the meteorological data from a buoy at a website like this, for buoy 8770475. You can download the data there or look at plots. Check out the website.
You can also directly download data once you know what web address to use to access that data. This can be really useful when you want to automate the process of downloading data instead of having to click around for it, and it matters more the more downloading you want to do. You can access the data from this buoy from January 1st to January 14th, 2016, with the following web address. That means that you can put this dynamic link directly into a call with
pandas
to read in data. Read in this buoy data to a dataframe so that the indices are datetime objects.url = 'https://tidesandcurrents.noaa.gov/cgi-bin/newdata.cgi?type=met&id=8770475&begin=20160101&end=20160114&units=metric&timezone=GMT&mode=csv&interval=6'
Now, read in data from buoy 8775237 from October 1 to 9, 2017. What is the url you should use to do this?
After you have the data set read in properly, plot wind speed vs. time with wind speed on the left y-axis, and on the same axes plot wind direction vs. time with wind direction on the right hand y-axis.
Note: where did I get this url from so that I could download the data directly?
In [ ]: