This data describes the consumption of coffee at the espresso machine in the Centre for Complexity Science, at the University of Warwick. The data collection exercise was facilitated by an Android app which automatically timestamped incoming data, and added the submitting user's name. In theory, whenever contributors ( thanks, guys ! ) had a coffee, they would query the machine as to how many coffees it had made to date, and submit the count via the Android app. In practice, the small number of contributers means that the physical process is poorly and irregularly sampled, full of holes, and not of the best quality. Concisely put, this is real data, my worst nightmare. Hey, at least it's strictly monotonic...
The purpose of this notebook is to perform some basic time series analysis on the data using primarily the functionality of Numpy, Scipy, and Matplotlib. A few other cool things might be thrown in for fun.
Go to the documentation for any function you're using. The docs are great, and contain examples as well as the different function overloads.
Below is a list of suggested exercises to do, with a little guidance in terms of how to do them. Below that are my solutions to these problems. So, a list of things to try :
Reading the Data
FASTEST : Using Pickle
with open("pickled_coffee.p") as f :
times, coffees, names, day_of_week, time_seconds, hour_of_day = pickle.load(f)
Note the indent ! The variables are now each column of the .csv. In the code that follows, you'll need to replace the Pandas object with the relevant variable. Here, time_seconds is the timestamp of the datapoint, in seconds from 1 Jan, 1970, 00:00 GMT ( the Epoch ).
BEST : Using Pandas
read_csv(filename), where filename is a string. This will generate a Pandas DataFrame object..columns variable for a list of column names, or simply call .head() for a preview..values returns a Numpy array.FOR MASOCHISTS : Using native Python, one method
myfile = open(filename)for line in myfile, which will give you a string..split(",") method, which splits the line at a comma.ALSO MASOCHISTIC : Using native Python, another method :
genfromtxt method to read the whole file into a variable all at once.skip_header argument to lose the row of headers.dtype = None to get Python to guess the type of variable per column.Plotting the Data
import matplotlib.pyplot as plt, you can call plt.plot(x, y) to plot a basic graph.Datetime Objects - skip if you're just using Time_Seconds
datetime class, so if you're using import datetime as dt, then you can call dt.datetimestrptime, which produces a datetime object from a string of text. You provide the format. For converting something like "17/03/2012 18:23", you have the format string "%d/%m/%Y %H:%M". If you give strptime the datetime string and the format string, it can understand that string and produce a datetime object.strftime method, which asks for a formatting string like the one above. If you just give it "%H", for example, it will just give you the hour of that datetime. Using these, you can extract a list of days of the week and of hours of the day from each time entry. See http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior for a table of formatting strings.Histograms
plt.hist() takes a list or array, and produces a histogram. plt.hist(mylist, mybins) ), then you've provided the boundaries between histogram bins, assuming mybins is a list or array. If it's a scalar, then you're telling it how many bins to use, and it calculates the boundaries itself. Critical Error - Days without Coffee ?!
diff() in Numpy to look at the difference in time between measurements. Plot these differences.np.diff(t) to find out. Always Acknowledge your Coworkers
plt.pie() takes a list or array, and produces a pie chart. plt.pie(mydata, mylabels) will put labels next to each slice. Generate a pie chart, showing who contributed to this time series.Computing an Average Sampling Rate
scipy.stats.linregress to perform a linear regression. Catch the gradient and intercept, and plot it on top of the time plot, so we can visually check the fit. Also catch the $R^2$, standard error, and the p-value of the regression. You can do this for the full plot, or for any region that looks particularly nice ( remember, this is real data... ).gradient, intercept, r2, pval, stderr = stats.linregress(indices, time)Regularisation of the Sampling
scipy.interpolate.interp1d(x, y) to create an interpolant. You can then pass an array of times that you want the interpolated values at. Create a variable regular_coffees that is the interpolation over the regularly-spaced time vector regular_time, and plot it on top of the normal t vs coffees to ensure that the interpolation was correct.Daily Average Over the Week
plt.errorbar and approximate 95% confidence intervals from the standard error of the mean using scipy.stats.sem.
In [4]:
# Basic numbercrunching
import numpy as np
# Interpolation, FFTs
from scipy import interpolate, stats
# Plotting
import matplotlib.pyplot as plt
import seaborn
figsize(10, 8)
# Datetime manipulation
# Be warned : functions in classes in modules are BADLY named...
import datetime as dt
# Data stuff I don't know how to use to its full potential, but read_csv() is nice
import pandas as pd
# Pickle, for saving variables
import pickle
In [5]:
# Let's read the data using Pandas, and look at what it contains.
data = pd.read_csv("CoffeeTimeSeries.csv")
data.head(20)
Out[5]:
In [6]:
# Let's just plot the time-series first.
# The easiest way to plot against the date variable is to go via
# a datetime object, which can later be formatted as we want it
# strptime creates a datetime object from a string :
times = [dt.datetime.strptime(i, "%d/%m/%Y %H:%M") for i in data["Timestamp"].values]
# We now have a list of datetime objects, which we can format as we like.
# Let's create a few items :
# a list of time values for plotting ( t )
# a list of strings of the day of the week ( weekday )
# a list of hours during the day ( dayhours )
# http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
t = np.array([int(i.strftime("%s")) for i in times]) / 86400. # in days
weekday = [int(i.strftime("%w")) for i in times] # weekday as number
weekdaylabels = np.unique([i.strftime("%A") for i in times]) # string of weekday
dayhours = [int(i.strftime("%H")) for i in times]
print t
In [7]:
# OK, we've extracted some useful time data. Let's grab the number of coffees and plot stuff.
coffees = data["Coffees"].values
plt.subplot2grid((2, 2), (0, 0), colspan=2)
plt.plot(t, coffees, linewidth=3)
plt.title("Coffee Consumption Time-Series")
plt.xlabel("Time (datetime seconds)")
plt.ylabel("Number of coffees")
plt.subplot(2, 2, 3)
plt.hist(weekday, 7)
plt.title("Consumption Per Day")
plt.xticks(range(7), ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]);
plt.ylabel("Frequency")
plt.subplot(2, 2, 4)
plt.hist(dayhours, np.unique(dayhours))
plt.title("Hours of Consumption During the Day")
plt.xlabel("Hour of Day")
plt.ylabel("Frequency")
plt.tight_layout()
In [8]:
# So, fairly uniform in terms of quantity consumed per day ( Monday blues ? ).
# Hourly is more interesting : it drops during the day, with an after-lunch spike.
# We can also infer a little about what time people arrive at work !
# Let's go back to the time-series itself. Was the machine broken at any point ?
# Consider over ten days a critical emergency.
plt.subplot(121)
plt.plot(np.diff(t), linewidth=3)
plt.axhline(10, color = seaborn.color_palette("deep", 3)[2], linewidth=3)
plt.title("Days between coffees")
plt.xlabel("Time (days)")
plt.ylabel("Days between coffees")
plt.subplot(122)
plt.hist(np.diff(t), np.arange(np.diff(t).max()))
plt.axvline(10, color = seaborn.color_palette("deep", 3)[2], linewidth=3)
plt.title("Distribution of Number of Days Between Coffees")
plt.xlabel("Days between coffees")
plt.ylabel("Frequency")
plt.tight_layout()
In [9]:
# Yes, it's been broken ! We note five occasions :
# five continguous measurement of over ten days between coffees.
# The last spike requires a little context. It's got a spike with three, not one, value
# above the 10-mark. This was because data had more or less stopped being collected
# ( you can see a few days between measurements after the last coffee machine breakdown
# and prior to this big spike ).
# So, if, until that point, measurements were *fairly* regular, we can ask who took them.
names = data["Name"].values
contributors = np.unique(names)
contributions = [np.sum(np.array(names) == i) for i in contributors]
plt.figure(figsize=(12,12))
plt.pie(contributions, labels=contributors, colors=seaborn.color_palette("deep", len(contributors)));
In [10]:
# So mostly me, with great contributions from Mike Irvine and Sergio Morales - thanks guys !
# Let's take a step back and consider the sampling of the time series. We saw earlier how
# the time between measurements varied a great deal. Let's look at raw sampling times first.
plt.plot(t, linewidth=3)
plt.plot(np.arange(t[0], t[-1]), linewidth=3)
plt.title("Sampling Times")
plt.xlabel("Index")
plt.ylabel("Sampling time")
plt.legend(["Measurements taken", "Linear time"], loc=2)
Out[10]:
In [11]:
# In general, the blue line is under the green line, so we sampled more than once a day.
# Let's calculate the average sampling frequency by finding the gradient of the line of best fit.
# We'll use the long, middle section, between 250 and 650
gradient, intercept, r2, pval, stderr = stats.linregress(range(250, 600), t[250:600])
plt.plot(t, linewidth=3)
plt.plot(np.arange(700) * gradient + intercept, linewidth=3)
plt.title("Approximate Rate of Sampling : %f per day; $R^2$ = %f, Standard Error = %f" % (1/gradient, r2, stderr))
plt.xlabel("Index")
plt.ylabel("Time at measurement (days)")
Out[11]:
In [12]:
# The sampling is not very regular. Let's interpolate to fix that.
# We'll use a standard linear interpolator.
# First, create a list of floats, starting at the first sample, and
# ending at or just before the last sample, incrementing in one day intervals
regular_time = np.arange(t[0], t[-1])
# Now let's interpolate the series
coffee_interpolator = interpolate.interp1d(t, coffees)
regular_coffees = coffee_interpolator(regular_time)
# Plot both time series against their respective times to confirm
# and only show every tenth interpolated point
plt.plot(t, coffees)
plt.plot(regular_time[::10], regular_coffees[::10], 'o')
plt.title("Interpolation of the Time-Series")
plt.xlabel("Time (days)")
plt.ylabel("Number of coffees devoured")
Out[12]:
In [13]:
# Now that we have a uniformly-sampled time series, we can start to look at regularities during the week.
# Let's compute the average number of coffees for each day of the week. First, let's look at the series itself.
plt.plot(np.diff(regular_coffees))
Out[13]:
In [14]:
# So we see a few issues. Firstly, the constants are due to interpolating over missing data.
# We could be clever about it, by reinterpolating after inserting a zero-change point just after long breaks,
# but in the interest of time, we'll just skip them for now. Let's start at 210, and finish at 420.
# The big dip is due to people being away. The time series started in October, which puts the dip
# squarely in the summer.
# So, let's compute that average.
# Declare empty arrays
weekly_average = np.zeros(7)
weekly_stderr = np.zeros(7)
# Iterating over each day of the week
for i in range(7) :
weekly_average[i] = np.mean(np.diff(regular_coffees[210:420])[i::7])
weekly_stderr[i] = stats.sem(np.diff(regular_coffees[210:420])[i::7])
# Plot with approximate 95% CI
plt.errorbar(range(7), weekly_average, yerr=weekly_stderr * 1.96, linewidth=3)
Out[14]:
In [15]:
# If you got this far, well done.
# You probably work as hard as Warwick Complexity, whose caffeine consumption doesn't drop to zero over the weekends...