In [2]:
# Loading the libraries
import pandas as pd
import numpy as np
import time
import datetime
The best way to understand how to handle dates with Python is to use an example. Here we will use a very small dataset with only 10 observations. The first column shows the date and the second shows the temperature in celsius degrees for that day.
In [32]:
# Loading the dataset
data = pd.read_csv('parse_date.csv', sep=',')
# Show the head - this is actually the entire data set.
data.head(10)
Out[32]:
In [36]:
# Selecting the cell on the first row and the first column
print data.iloc[0,0]
# Printing the type of this cell
type(data.iloc[0,0])
Out[36]:
We see here that when loaded into our dataframe, the values in the 'date' column are considered as strings and not as real dates. The first step to take is therefore to convert these strings into datetime objects that we can easily manipulate. Fortunately, Python has the very useful strptime(data, format) method that does exactly what we want. Let's see how this works. Note that the strptime() method is part of the datetime class of the datetime module, hence the datetime.datetime.strptime().
In [9]:
# Let's take only one date value in our table and convert it into a datatime object
date = datetime.datetime.strptime(data.iloc[0,0],'%d/%m/%Y').date()
# Print the new datetime object
print date
# Check the type of the converted value
print type(date)
Success! Our string is now a datetime object that we can format it as we wish. Now let's convert all the date strings into datetime objects.
In [12]:
# Create a function that will transform all the strings into dates.
def parse_date(raw_date):
return datetime.datetime.strptime(raw_date, '%d/%m/%Y').date()
In [13]:
# Let's apply our format to all the dates in our DataFrame
data['date'] = data['date'].map(parse_date)
Word of caution: Whenever you define your function, make sure that the date format you specify in the strptime() method is the same as the one in the date string. In our case, the string ('01/01/2013') has the following format: it starts with the day of the month %d ('01'), followed by a '/', followed by the month %m (01), another '/' and ends up with the full year %Y (2013).
To learn more about the different formats, check this link.
In [14]:
# Let's have a look at the whold dataset
data.head()
Out[14]:
In [15]:
type(data.iloc[0,0])
Out[15]:
Now that our dates are properly transformed our strings into date objects we can manipulate these dates as much as we want. One thing that is often required to change the format of the date. Python has another method called strftime(format) (yes with an f) that does the opposite of the strptime() method by converting a datetime object into a string. Let's see how we can convert our dates with the strftime() method.
In [16]:
# We want to have the complete date - See the new column "Long date" for the result
data['Long Date'] = data['date'].map(lambda x: x.strftime('%A, %B %d, %Y'))
data.head()
Out[16]:
In [46]:
# Now we want to just see a short date - see the "Short date" column
data['Short Date'] = data['date'].map(lambda x: x.strftime('%b %d, %Y'))
data.head()
Out[46]:
Sometimes you may want to analyze the difference between days of the week. Maybe for instance you want to know if there is more traffic on Mondays than on Sundays. One cool trick is to convert the dates into their corresponding weekdays.
In [47]:
# Add the weekday column to our dataframe
data['weekday'] = data['date'].map(lambda x: x.strftime('%A'))
data.head()
Out[47]:
Similar to the weekday, you can now analyze the time series by looking at the difference between months.
In [48]:
# Add the month column to our dataframe
data['month'] = data['date'].map(lambda x: x.strftime('%B'))
data.head()
Out[48]:
The configurations are endless, here are some other examples of format that you may want to use.
Pandas is also extremely powerful with time series and you can use it to easily convert, generate and index dates. One difference to have in mind when using Pandas is the one between a timestamp and a time span. A timestamp refers to a specific point in time while a time span is a period. Let's explore this difference in more details.
In [33]:
# Create a timestamp using Timestamp
timestamps = pd.Timestamp('2014-01-01')
print timestamps
With time spans you can create any type of period: Hour, Day, Week, Month, Quarter...
In [24]:
# An hour
hour = pd.Period('2014-01', freq='H')
print hour
In [25]:
# A week
week = pd.Period('2014-01', freq='W')
print week
In [27]:
# A month
month = pd.Period('2014-01', freq='M')
print month
In [29]:
# A quarter
quarter = pd.Period('2014-01', freq='Q')
print quarter
With Pandas you can also convert strings into dates using the to_datetime() method. Let's try with our dataset.
In [59]:
# Converting dates to timestamps
timestamp = pd.to_datetime(data['date'])
print timestamp
You might want to you use the dates as index for your dataframe but if you look at the cell above, converting the dates with to_datetime() will not do this automatically. To use the dates as index, you will need one extra step. You can either use the DateTimeIndex() - for timestamps - or PeriodIndex() method - for time spans.
In [64]:
# Array of timestamps used as index
date_index = pd.DatetimeIndex(timestamp, freq='D')
print date_index
# Array of periods used as index
period_index = pd.PeriodIndex(timestamp, freq= 'D')
print period_index
Please note that the offset aliases are different between Python and Pandas. To see the complete list of aliases please check the following link
Until now we have worked with dates extracted from a csv file but what if we want to generate a new time series? Pandas has of course some handful tools for this too. Use the method date_range(). With this method you can either specify the start date and the number of periods you need or specify both the start and the end dates. Both methods are presented in the cell below.
In [36]:
# You specify the number of periods - here 24 months
rng = pd.date_range('1/1/2014', periods=24, freq='M')
# You can also specify the begining and the end of your data series
start = '1/1/2013'
end = '1/1/2014'
rng = pd.date_range(start, end, freq='M')
If you want to generate a time series of random values and use your dates as index, follow these 2 steps:
In [3]:
rng = pd.date_range('1/1/2014', periods=24, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.head()
Out[3]:
Now that you have generated your time series, you want to select only the first 3 months or the last 3. For this you can use first(period) or last(period) methods. Between the brackets, you specify the number of periods that you want and the type of periods (hour, week, month,...).
In [8]:
#We want to select the 5 first months
ts.first('5M')
Out[8]:
In [9]:
# Select the last 3 months
ts.last('3M')
Out[9]:
You can also select specific period of time from your time series. For this I would recommend to use a mask.
In [6]:
# Selecting the data from january only
mask = (ts.index.month ==1)
january_data = ts[mask]
january_data
Out[6]:
In [7]:
# Selecting the data from the first quarter
mask = (ts.index.month >= 1) & (ts.index.month <=3)
first_quarter_data = ts[mask]
first_quarter_data
Out[7]:
In [8]:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[8]: