Introduction to data analytics with pandas

Quentin Caudron

PyData Seattle, July 2017

Systems check

Do you have a working Python installation, with the pandas package ?



In [1]:

    
import pandas as pd

Note : This cell should run without raising a traceback. Assuming it runs, you can also try printing the value of pd.__version__ to see what version of pandas you have installed.

A little about me

Lapsed computational physicist
PhD computational neuroscience, postdoc statistical epidemiology
Data Scientist at CBRE - www.cbredev.com
ATOM at Seattle

A little about the hero of this story

We'll be analysing a real-world dataset together. It's about my favourite thing in the world : coffee. This dataset was collected at the Mathematics Institute at the University of Warwick. It's a time-series dataset, describing the total number of coffees made by our espresso machine by a certain date.

A little about this workshop

We'll be running through an analysis of this dataset as a way to expose you to the pandas API. The aim is to develop a little familiarity with how to work with pandas.

Slides are available at https://github.com/QCaudron/pydata_pandas. One notebook contains solutions; beware of spoilers.

The notebooks contain notes about what we're doing that I'll skip during this workshop, but try to explain on the way.

The pandas API is enormous. The documentation is excellent, don't hesitate to look things up.

Key questions

The dataset : total number of coffees made vs time.

Who are the main contributors to this dataset, and when are contributions generally made ?
What are the department's weekday coffee habits ?
How much coffee are people drinking ?

Let's begin



In [2]:

    
import pandas as pd
%matplotlib inline

Note : The second line here tells matplotlib to plot directly under the cell where any plotting code is called. pandas uses matplotlib to generate graphs, and without this, the graphs would appear outside the Jupyter notebook when you called plt.show() - but we just want them to appear without having to do this.

http://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1

Importing the data

Let's import the coffee data from CSV.



In [3]:

    
# Read data from data/coffees.csv
data = pd.read_csv("data/coffees.csv")

Note : pandas can read from many data formats : CSV, JSON, Excel, HDF5, SQL, and more.

http://pandas.pydata.org/pandas-docs/version/0.20/io.html

What does this data look like ?



In [4]:

    
data









    Out[4]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      0
      2011-10-03 08:22:00
      397.0
      Quentin
    
    
      1
      2011-10-04 11:48:00
      410.0
      Quentin
    
    
      2
      2011-10-05 07:02:00
      testing
      Anthony
    
    
      3
      2011-10-05 08:25:00
      NaN
      Quentin
    
    
      4
      2011-10-05 10:47:00
      464.0
      Quentin
    
    
      5
      2011-10-05 13:15:00
      481.0
      Quentin
    
    
      6
      2011-10-06 07:21:00
      503.0
      Anthony
    
    
      7
      2011-10-06 10:04:00
      513.0
      Quentin
    
    
      8
      2011-10-06 12:14:00
      539.0
      Mike
    
    
      9
      2011-10-06 12:49:00
      540.0
      Quentin
    
    
      10
      2011-10-06 14:52:00
      563.0
      Ben
    
    
      11
      2011-10-07 07:34:00
      581.0
      Anthony
    
    
      12
      2011-10-07 08:37:00
      587.0
      Quentin
    
    
      13
      2011-10-07 11:09:00
      605.0
      Quentin
    
    
      14
      2011-10-07 13:14:00
      616.0
      Mike
    
    
      15
      2011-10-07 14:10:00
      NaN
      Ben
    
    
      16
      2011-10-07 15:20:00
      626.0
      Mike M
    
    
      17
      2011-10-07 16:50:00
      635.0
      Mike M
    
    
      18
      2011-10-09 16:53:00
      650.0
      Colm
    
    
      19
      2011-10-10 07:29:00
      656.0
      Anthony
    
    
      20
      2011-10-10 10:13:00
      673.0
      Quentin
    
    
      21
      2011-10-10 13:41:00
      694.0
      Mike M
    
    
      22
      2011-10-10 14:02:00
      699.0
      Quentin
    
    
      23
      2011-10-10 15:23:00
      713.0
      Quentin
    
    
      24
      2011-10-11 14:09:00
      770.0
      Mike M
    
    
      25
      2011-10-12 08:11:00
      790.0
      Quentin
    
    
      26
      2011-10-12 09:57:00
      799.0
      Mike M
    
    
      27
      2011-10-12 10:06:00
      805.0
      Sergio
    
    
      28
      2011-10-12 12:01:00
      818.0
      Mike M
    
    
      29
      2011-10-12 12:30:00
      819.0
      Quentin
    
    
      ...
      ...
      ...
      ...
    
    
      641
      2013-01-28 10:43:00
      NaN
      Sergio
    
    
      642
      2013-01-28 13:13:00
      NaN
      Quentin
    
    
      643
      2013-01-28 14:01:00
      16195.0
      Quentin
    
    
      644
      2013-01-29 13:43:00
      16237.0
      Quentin
    
    
      645
      2013-01-29 15:06:00
      16257.0
      Quentin
    
    
      646
      2013-02-04 13:25:00
      16513.0
      Sergio
    
    
      647
      2013-02-06 17:33:00
      16659.0
      Quentin
    
    
      648
      2013-02-07 13:30:00
      16714.0
      Sergio
    
    
      649
      2013-02-12 08:36:00
      16891.0
      Sergio
    
    
      650
      2013-02-12 11:39:00
      16909.0
      Quentin
    
    
      651
      2013-02-13 13:58:00
      16977.0
      Quentin
    
    
      652
      2013-02-16 11:55:00
      17104.0
      Quentin
    
    
      653
      2013-02-18 12:04:00
      NaN
      Quentin
    
    
      654
      2013-02-18 13:46:00
      17165.0
      Quentin
    
    
      655
      2013-02-21 13:44:00
      17345.0
      Quentin
    
    
      656
      2013-02-21 15:02:00
      17354.0
      Quentin
    
    
      657
      2013-02-25 13:33:00
      17468.0
      Quentin
    
    
      658
      2013-02-25 17:25:00
      17489.0
      Quentin
    
    
      659
      2013-02-27 09:33:00
      17564.0
      Quentin
    
    
      660
      2013-03-04 10:46:00
      17789.0
      Sergio
    
    
      661
      2013-03-04 11:12:00
      17793.0
      Quentin
    
    
      662
      2013-03-04 16:43:00
      17824.0
      Quentin
    
    
      663
      2013-03-05 10:42:00
      17852.0
      Quentin
    
    
      664
      2013-03-05 13:29:00
      17868.0
      Quentin
    
    
      665
      2013-03-08 10:28:00
      18062.0
      Quentin
    
    
      666
      2013-03-12 08:28:00
      18235.0
      Sergio
    
    
      667
      2013-04-05 11:20:00
      18942.0
      Sergio
    
    
      668
      2013-04-27 11:04:00
      19698.0
      Sergio
    
    
      669
      2013-09-12 15:38:00
      24450.0
      Quentin
    
    
      670
      2013-09-13 10:28:00
      24463.0
      Quentin
    
  

671 rows × 3 columns

Let's just look at the first few rows.



In [5]:

    
# .head()
data.head()









    Out[5]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      0
      2011-10-03 08:22:00
      397.0
      Quentin
    
    
      1
      2011-10-04 11:48:00
      410.0
      Quentin
    
    
      2
      2011-10-05 07:02:00
      testing
      Anthony
    
    
      3
      2011-10-05 08:25:00
      NaN
      Quentin
    
    
      4
      2011-10-05 10:47:00
      464.0
      Quentin

We have an index, and three columns : timestamp, coffees, and contributor.

Uh-oh. Why is there a string of text, testing, in our coffee numbers ? What's going on in the coffees column in the row after that ?

Note : df.head(n=10) would show the first ten rows. The default is n=5.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

Let's look at that string in the third row.



In [6]:

    
# .loc or .iloc
data.loc[2]









    Out[6]:





timestamp      2011-10-05 07:02:00
coffees                    testing
contributor                Anthony
Name: 2, dtype: object

Definitely a string. We'll note this as something to fix after we finish looking around.

Note : .loc uses a label-based lookup, which means that the value you pass into the square brackets must be in the index. Another method, .iloc, is integer-location-based, so .iloc[2] would return the third row. In this case, they're the same, but had we changed our index, as we'll see later, things would work differently.

Indexing a dataframe with [] directly returns a pd.Series or pd.DataFrame by searching over columns, not rows. Indexing a pd.Series with [] is like indexing a dataframe with .iloc.

https://pandas.pydata.org/pandas-docs/stable/indexing.html

We should also take a look at that NaN. In fact, let's look at the first five values in `coffees`.



In [7]:

    
# [] indexing on a series
data.coffees[:5]









    Out[7]:





0      397.0
1      410.0
2    testing
3        NaN
4      464.0
Name: coffees, dtype: object

Note : here, we're indexing a series ( a pd.Series object ). From a pd.DataFrame ( here, data ), when you access a single column ( data.coffees or data["coffees"] ), the object returned is a pd.Series. From that, indexing directly with [] works in an integer-location-based manner, and like with numpy arrays, you can take slices ( [:5] ).

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

How long is the dataset ?



In [8]:

    
print("Dataset length :")

# len()
print(len(data))









    



Dataset length :
671

What else can we find out ?



In [9]:

    
# .describe()
data.describe()









    Out[9]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      count
      671
      658
      671
    
    
      unique
      671
      654
      9
    
    
      top
      2012-09-26 16:02:00
      12358.0
      Quentin
    
    
      freq
      1
      2
      367

Looks like we also have some missing data - we have 671 rows, but the coffees column only has 658 entries.

Note : .describe() returns different things based on what's in the dataframe, as we'll see later. For numerical columns, it will return things like the mean, standard deviation, and percentiles. For object columns ( strings or datetimes ), it will return the most frequent entry and the first and last items. For all columns, .describe() will return the count of objects in that column ( not counting NaNs ) and the unique number of entries. You can determine what's returned using .describe()'s keyword arguments.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

Let's look at the dataframe where `coffees` is null.



In [10]:

    
# .isnull() and boolean indexing with []
data[data.coffees.isnull()]









    Out[10]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      3
      2011-10-05 08:25:00
      NaN
      Quentin
    
    
      15
      2011-10-07 14:10:00
      NaN
      Ben
    
    
      72
      2011-10-28 10:53:00
      NaN
      Mike M
    
    
      95
      2011-11-11 11:13:00
      NaN
      Quentin
    
    
      323
      2012-06-10 16:10:00
      NaN
      Sergio
    
    
      370
      2012-07-13 13:59:00
      NaN
      Mike
    
    
      394
      2012-08-03 14:35:00
      NaN
      Sergio
    
    
      479
      2012-09-21 10:15:00
      NaN
      Sergio
    
    
      562
      2012-11-01 09:45:00
      NaN
      Quentin
    
    
      606
      2012-11-30 13:11:00
      NaN
      Quentin
    
    
      641
      2013-01-28 10:43:00
      NaN
      Sergio
    
    
      642
      2013-01-28 13:13:00
      NaN
      Quentin
    
    
      653
      2013-02-18 12:04:00
      NaN
      Quentin

Note : .isnull() returns a boolean array ( an array of Trues and Falses ), that you can then use to index the dataframe directly. Here, our boolean array tells us which entries in the coffees column are null, and we use that to index against the full dataframe - so we get back every column in the dataframe, but only those rows where coffees is null.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html

What type of Python objects are the columns ?



In [11]:

    
# .dtypes
data.dtypes









    Out[11]:





timestamp      object
coffees        object
contributor    object
dtype: object

The contributor column makes sense as object, because we expect strings there; but surely the timestamp should be a timestamp-type, and coffees should be numerical ?

Let's inspect what's in the `timestamp` column.



In [12]:

    
# print the first element of the series with [] indexing
print(data.timestamp[0])

# print its type()
print(type(data.timestamp[0]))









    



2011-10-03 08:22:00
<class 'str'>

It looks like the timestamp field was read from CSV as a string. That makes sense - CSV files are very basic. We'll have pandas interpret these strings as datetimes for us automatically.

Note : here's an example of using direct [] indexing on a pd.Series. We're accessing the first entry, just to see what type of object we have there.

On our first pass, what problems did we find ?

The timestamp column contains strings; these need to be datetimes
The coffees column contains some null values and at least one string

Cleaning the data

The `coffees` column should only contain numerical data.



In [13]:

    
# cast the coffees column using pd.to_numeric, and coerce errors
data.coffees = pd.to_numeric(data.coffees, errors="coerce")

data.head()









    Out[13]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      0
      2011-10-03 08:22:00
      397.0
      Quentin
    
    
      1
      2011-10-04 11:48:00
      410.0
      Quentin
    
    
      2
      2011-10-05 07:02:00
      NaN
      Anthony
    
    
      3
      2011-10-05 08:25:00
      NaN
      Quentin
    
    
      4
      2011-10-05 10:47:00
      464.0
      Quentin

The `coffees` column contains NaNs.



In [14]:

    
# Use .dropna() using a subset, and pass inplace
data.dropna(subset=["coffees"], inplace=True)

data.head()









    Out[14]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      0
      2011-10-03 08:22:00
      397.0
      Quentin
    
    
      1
      2011-10-04 11:48:00
      410.0
      Quentin
    
    
      4
      2011-10-05 10:47:00
      464.0
      Quentin
    
    
      5
      2011-10-05 13:15:00
      481.0
      Quentin
    
    
      6
      2011-10-06 07:21:00
      503.0
      Anthony

The `coffees` column is of type `float`.



In [15]:

    
# Cast to int using .astype()
data.coffees = data.coffees.astype(int)

data.head()









    Out[15]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      0
      2011-10-03 08:22:00
      397
      Quentin
    
    
      1
      2011-10-04 11:48:00
      410
      Quentin
    
    
      4
      2011-10-05 10:47:00
      464
      Quentin
    
    
      5
      2011-10-05 13:15:00
      481
      Quentin
    
    
      6
      2011-10-06 07:21:00
      503
      Anthony

Let's have pandas parse the `timestamp` strings to datetime objects.



In [16]:

    
# pd.to_datetime()
data.timestamp = pd.to_datetime(data.timestamp)

# Confirm dtypes
data.dtypes









    Out[16]:





timestamp      datetime64[ns]
coffees                 int64
contributor            object
dtype: object

So where do we stand ?



In [17]:

    
# .describe(), passing the include kwarg to see all information
data.describe(include="all")









    Out[17]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      count
      657
      657.000000
      657
    
    
      unique
      657
      NaN
      9
    
    
      top
      2011-10-24 14:32:00
      NaN
      Quentin
    
    
      freq
      1
      NaN
      361
    
    
      first
      2011-10-03 08:22:00
      NaN
      NaN
    
    
      last
      2013-09-13 10:28:00
      NaN
      NaN
    
    
      mean
      NaN
      8568.471842
      NaN
    
    
      std
      NaN
      4600.215049
      NaN
    
    
      min
      NaN
      397.000000
      NaN
    
    
      25%
      NaN
      4986.000000
      NaN
    
    
      50%
      NaN
      9172.000000
      NaN
    
    
      75%
      NaN
      11562.000000
      NaN
    
    
      max
      NaN
      24463.000000
      NaN



In [18]:

    
# What do the first few rows look like ?
data.head()









    Out[18]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      0
      2011-10-03 08:22:00
      397
      Quentin
    
    
      1
      2011-10-04 11:48:00
      410
      Quentin
    
    
      4
      2011-10-05 10:47:00
      464
      Quentin
    
    
      5
      2011-10-05 13:15:00
      481
      Quentin
    
    
      6
      2011-10-06 07:21:00
      503
      Anthony

Note : .describe(include="all") is describing all attributes of all columns, but some don't make sense based on the column's dtype. For example, the contributor column has no first and last attributes, because those describe the first and last entries in an ordered series. That makes sense for the timestamp - those have an intuitive definition of sorting - but not so much for strings ( alphabetical order doesn't really matter when they're arbitrary strings ). Similary, the timestamp column has no mean or other numerical traits. What does it mean to calculate the mean timestamp ?

The time-series at a glance

Let's begin by visualising the coffee counts.



In [19]:

    
# .plot() on the coffees series
data.coffees.plot()









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c2336a0>

pandas is plotting the coffees against the index, which is just a series of integers.

Note : .plot() on a pd.Series will plot the data against the index. On a pd.DataFrame, the .plot() method allows plotting of one column against another.

By default, .plot() renders a line graph, but you can specify which type of plot you'd like - bar, line, histogram, area, scatter, etc..

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

Let's use the dataframe's `plot()` method rather than that of the series.



In [20]:

    
# .plot() on the dataframe, setting x to the timestamp, with dot-dash style
data.plot(x=data.timestamp, style=".-")









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c258d30>

We have some very uneven spacing in places. We might start by cutting off the last few points of this time-series, which is missing a lot of data.

We'll inspect the last few points of this time-series.



In [21]:

    
# .tail() with ten rows
data.tail(n=10)









    Out[21]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      661
      2013-03-04 11:12:00
      17793
      Quentin
    
    
      662
      2013-03-04 16:43:00
      17824
      Quentin
    
    
      663
      2013-03-05 10:42:00
      17852
      Quentin
    
    
      664
      2013-03-05 13:29:00
      17868
      Quentin
    
    
      665
      2013-03-08 10:28:00
      18062
      Quentin
    
    
      666
      2013-03-12 08:28:00
      18235
      Sergio
    
    
      667
      2013-04-05 11:20:00
      18942
      Sergio
    
    
      668
      2013-04-27 11:04:00
      19698
      Sergio
    
    
      669
      2013-09-12 15:38:00
      24450
      Quentin
    
    
      670
      2013-09-13 10:28:00
      24463
      Quentin

After mid-March, things start getting spaced rather erratically.

Let's cut off the tail of the time-series, anything after 2013-03-01.



In [22]:

    
# Use conditional indexing against the timestamp
data = data[data.timestamp < "2013-03-01"]

data.tail()









    Out[22]:







  
    
      
      timestamp
      coffees
      contributor
    
  
  
    
      655
      2013-02-21 13:44:00
      17345
      Quentin
    
    
      656
      2013-02-21 15:02:00
      17354
      Quentin
    
    
      657
      2013-02-25 13:33:00
      17468
      Quentin
    
    
      658
      2013-02-25 17:25:00
      17489
      Quentin
    
    
      659
      2013-02-27 09:33:00
      17564
      Quentin

Note : this is another example of boolean indexing. data.timestamp < "2013-03-01" is a boolean array, and can be passed into the dataframe immediately in [], much like with a np.ndarray.

One final look.



In [23]:

    
# Once again, plot the data against the timestamp
data.plot(x=data.timestamp, style=".-")









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c2ede48>

1. Contributions to the time-series

Who are our main contributors ?



In [24]:

    
# .value_counts()
data.contributor.value_counts()









    Out[24]:





Quentin    354
Mike       152
Sergio      89
Mike M      29
Colm        12
Anthony      4
Ben          3
Stefano      2
Arturo       1
Name: contributor, dtype: int64

Note : .value_counts() counts the unique values in a series. It's similar to doing a .groupby() followed by a .count(), as we'll see soon.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

Let's plot this.



In [25]:

    
# .plot() a bar chart from the value counts
data.contributor.value_counts().plot(kind="bar")









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c443278>

On which weekdays were contributions made ?



In [26]:

    
# Create a series of the weekdays for each entry using .dt.weekday
weekdays = data.timestamp.dt.weekday

# assign() it to our dataframe
data = data.assign(weekdays=weekdays)

data.head()









    Out[26]:







  
    
      
      timestamp
      coffees
      contributor
      weekdays
    
  
  
    
      0
      2011-10-03 08:22:00
      397
      Quentin
      0
    
    
      1
      2011-10-04 11:48:00
      410
      Quentin
      1
    
    
      4
      2011-10-05 10:47:00
      464
      Quentin
      2
    
    
      5
      2011-10-05 13:15:00
      481
      Quentin
      2
    
    
      6
      2011-10-06 07:21:00
      503
      Anthony
      3

Can we replace these integers with actual weekdays ?



In [27]:

    
weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
weekday_dict = {key: weekday_names[key] for key in range(7)}

# Use .apply() to apply a custom function to the weekdays column
data.weekdays = data.weekdays.apply(lambda x: weekday_dict[x])

data.head()









    Out[27]:







  
    
      
      timestamp
      coffees
      contributor
      weekdays
    
  
  
    
      0
      2011-10-03 08:22:00
      397
      Quentin
      Monday
    
    
      1
      2011-10-04 11:48:00
      410
      Quentin
      Tuesday
    
    
      4
      2011-10-05 10:47:00
      464
      Quentin
      Wednesday
    
    
      5
      2011-10-05 13:15:00
      481
      Quentin
      Wednesday
    
    
      6
      2011-10-06 07:21:00
      503
      Anthony
      Thursday

Let's group by these weekdays.



In [28]:

    
# .groupby() the weekdays and then .count() rows in each group
weekday_counts = data.groupby("weekdays").count()

# We can reorder this dataframe by our weekday_names list
weekday_counts = weekday_counts.loc[weekday_names]

weekday_counts









    Out[28]:







  
    
      
      timestamp
      coffees
      contributor
    
    
      weekdays
      
      
      
    
  
  
    
      Monday
      145
      145
      145
    
    
      Tuesday
      97
      97
      97
    
    
      Wednesday
      131
      131
      131
    
    
      Thursday
      117
      117
      117
    
    
      Friday
      114
      114
      114
    
    
      Saturday
      29
      29
      29
    
    
      Sunday
      13
      13
      13

Note : this first line could be replaced by weekday_counts = data.weekdays.value_counts(), with the only difference being that that would return a series to us, and here, we got back a dataframe.

We can now visualise these weekday counts.



In [29]:

    
# .plot() a bar chart of data in weekday_counts
weekday_counts.timestamp.plot(kind="bar", title="Datapoints added on each weekday")









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c509f98>

2. Weekday trends

First, we'll set our timestamps to the dataframe's index



In [30]:

    
# Set the dataframe's .index property
data.index = data.timestamp

# Let's drop the timestamp column, as we no longer need it
data.drop(["timestamp"], axis=1, inplace=True)

data.head()









    Out[30]:







  
    
      
      coffees
      contributor
      weekdays
    
    
      timestamp
      
      
      
    
  
  
    
      2011-10-03 08:22:00
      397
      Quentin
      Monday
    
    
      2011-10-04 11:48:00
      410
      Quentin
      Tuesday
    
    
      2011-10-05 10:47:00
      464
      Quentin
      Wednesday
    
    
      2011-10-05 13:15:00
      481
      Quentin
      Wednesday
    
    
      2011-10-06 07:21:00
      503
      Anthony
      Thursday

Let's add some rows at midnight on every day.



In [31]:

    
# pd.date_range, with daily frequency, and normalisation
midnights = pd.date_range(data.index[0], data.index[-1], freq="D", normalize=True)

midnights









    Out[31]:





DatetimeIndex(['2011-10-03', '2011-10-04', '2011-10-05', '2011-10-06',
               '2011-10-07', '2011-10-08', '2011-10-09', '2011-10-10',
               '2011-10-11', '2011-10-12',
               ...
               '2013-02-18', '2013-02-19', '2013-02-20', '2013-02-21',
               '2013-02-22', '2013-02-23', '2013-02-24', '2013-02-25',
               '2013-02-26', '2013-02-27'],
              dtype='datetime64[ns]', length=514, freq='D')

Note : pd.date_range creates a fixed-frequency DatetimeIndex. normalize=True ensures these datetimes are at midnight, and not at whatever time the starting point is.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html

Let's take the union of this index and our dataset's index.



In [32]:

    
# Take the union of the existing and new indices
new_index = midnights.union(data.index)

new_index









    Out[32]:





DatetimeIndex(['2011-10-03 00:00:00', '2011-10-03 08:22:00',
               '2011-10-04 00:00:00', '2011-10-04 11:48:00',
               '2011-10-05 00:00:00', '2011-10-05 10:47:00',
               '2011-10-05 13:15:00', '2011-10-06 00:00:00',
               '2011-10-06 07:21:00', '2011-10-06 10:04:00',
               ...
               '2013-02-21 15:02:00', '2013-02-22 00:00:00',
               '2013-02-23 00:00:00', '2013-02-24 00:00:00',
               '2013-02-25 00:00:00', '2013-02-25 13:33:00',
               '2013-02-25 17:25:00', '2013-02-26 00:00:00',
               '2013-02-27 00:00:00', '2013-02-27 09:33:00'],
              dtype='datetime64[ns]', length=1160, freq=None)

Note : the union of these indices is just a new index where entries from both indices are present. It's sorted by time.

Now we can reindex our dataframe with this new index.



In [33]:

    
# .reindex() the dataframe
upsampled_data = data.reindex(new_index)

upsampled_data.head(10)









    Out[33]:







  
    
      
      coffees
      contributor
      weekdays
    
  
  
    
      2011-10-03 00:00:00
      NaN
      NaN
      NaN
    
    
      2011-10-03 08:22:00
      397.0
      Quentin
      Monday
    
    
      2011-10-04 00:00:00
      NaN
      NaN
      NaN
    
    
      2011-10-04 11:48:00
      410.0
      Quentin
      Tuesday
    
    
      2011-10-05 00:00:00
      NaN
      NaN
      NaN
    
    
      2011-10-05 10:47:00
      464.0
      Quentin
      Wednesday
    
    
      2011-10-05 13:15:00
      481.0
      Quentin
      Wednesday
    
    
      2011-10-06 00:00:00
      NaN
      NaN
      NaN
    
    
      2011-10-06 07:21:00
      503.0
      Anthony
      Thursday
    
    
      2011-10-06 10:04:00
      513.0
      Quentin
      Thursday

Note : .reindex() keeps any values that conform to the new index, and inserts NaNs where we have no values.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

We can fill in these `NaN`s using interpolation.



In [34]:

    
# .interpolate the upsampled_data using the time method
upsampled_data = upsampled_data.interpolate(method="time")

upsampled_data.head(10)









    Out[34]:







  
    
      
      coffees
      contributor
      weekdays
    
  
  
    
      2011-10-03 00:00:00
      NaN
      NaN
      NaN
    
    
      2011-10-03 08:22:00
      397.000000
      Quentin
      Monday
    
    
      2011-10-04 00:00:00
      404.408262
      NaN
      NaN
    
    
      2011-10-04 11:48:00
      410.000000
      Quentin
      Tuesday
    
    
      2011-10-05 00:00:00
      438.664249
      NaN
      NaN
    
    
      2011-10-05 10:47:00
      464.000000
      Quentin
      Wednesday
    
    
      2011-10-05 13:15:00
      481.000000
      Quentin
      Wednesday
    
    
      2011-10-06 00:00:00
      494.066298
      NaN
      NaN
    
    
      2011-10-06 07:21:00
      503.000000
      Anthony
      Thursday
    
    
      2011-10-06 10:04:00
      513.000000
      Quentin
      Thursday

We're now ready to resample the time-series at a daily frequency.



In [35]:

    
# .resample() followed by .asfreq()
daily_data = upsampled_data.resample("D").asfreq()

# Drop the contributor column, we no longer need it
daily_data = daily_data.drop(["contributor"], axis=1)

# Generate a column of weekday names
daily_data["weekdays"] = daily_data.index.weekday_name  # We did it the slow way before...
daily_data.head()









    Out[35]:







  
    
      
      coffees
      weekdays
    
  
  
    
      2011-10-03
      NaN
      Monday
    
    
      2011-10-04
      404.408262
      Tuesday
    
    
      2011-10-05
      438.664249
      Wednesday
    
    
      2011-10-06
      494.066298
      Thursday
    
    
      2011-10-07
      572.844311
      Friday



In [36]:

    
# Let's plot the data once more, to see how we're doing
daily_data.plot(figsize=(15, 4), style=".")









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c6687f0>

Let's begin by figuring out how many coffees are made on any given day.



In [37]:

    
# Use .diff() on the coffees column; follow up with .shift()
coffees_made = daily_data.coffees.diff().shift(-1)

# Add this as a column to the dataframe
daily_data["coffees_made_today"] = coffees_made

daily_data.head()









    Out[37]:







  
    
      
      coffees
      weekdays
      coffees_made_today
    
  
  
    
      2011-10-03
      NaN
      Monday
      NaN
    
    
      2011-10-04
      404.408262
      Tuesday
      34.255987
    
    
      2011-10-05
      438.664249
      Wednesday
      55.402049
    
    
      2011-10-06
      494.066298
      Thursday
      78.778013
    
    
      2011-10-07
      572.844311
      Friday
      64.392941

Note : we use .shift() here because if we look at the .diff() between a Monday and a Tuesday, those coffees are attributed to the Tuesday. However, what we want to say is "this many coffees were made at some point on the Monday", so we shift the entire series up one.

Now we can group this by weekday.



In [38]:

    
# .groupby weekdays, take the mean, and grab the coffees_made_today column
coffees_by_day = daily_data.groupby("weekdays").mean().coffees_made_today

coffees_by_day









    Out[38]:





weekdays
Friday       34.138034
Monday       37.802404
Saturday     16.299858
Sunday       15.768215
Thursday     43.500824
Tuesday      39.132964
Wednesday    47.640057
Name: coffees_made_today, dtype: float64

Let's order this series and then plot it.



In [39]:

    
# Sort coffees_by_day by our list of weekday names
coffees_by_day = coffees_by_day[weekday_names]

# Plot a bar chart
coffees_by_day.plot(kind="bar")









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c70a0f0>

Wednesdays was seminar day...

3. Coffee per person

We can now pull in data on how many people were in the department.



In [40]:

    
# Bring in data/department_members.csv;
# have the first column be the index, and parse the dates
people = pd.read_csv("data/department_members.csv", index_col=[0], parse_dates=True)

people.head()

Let's join the datasets.



In [41]:

    
# Use an outer join, then interpolate over missing values using nearest values
daily_data = daily_data.join(people, how="outer").interpolate(method="nearest")

daily_data.head()









    Out[41]:







  
    
      
      coffees
      weekdays
      coffees_made_today
      members
    
  
  
    
      2011-08-31
      NaN
      NaN
      NaN
      12.0
    
    
      2011-09-30
      NaN
      NaN
      NaN
      14.0
    
    
      2011-10-03
      NaN
      Monday
      NaN
      14.0
    
    
      2011-10-04
      404.408262
      Tuesday
      34.255987
      14.0
    
    
      2011-10-05
      438.664249
      Wednesday
      55.402049
      14.0

Note : by default, inner joins are performed. That is, if a row from one of the datasets has an index that isn't in the other dataset, that row is dropped. You can specify whether you want outer, left, or right joins, as well plenty of other useful options. The pandas API for joining or merging datasets is very developed.

https://pandas.pydata.org/pandas-docs/stable/merging.html

Let's create a column for the number of coffees consumed per person.



In [42]:

    
# New column is the ratio of coffees made on a given day to number of members in the department
daily_data["coffees_per_person"] = daily_data.coffees_made_today / daily_data.members

# Let's drop those remaining NaNs while we're at it
daily_data.dropna(inplace=True)

daily_data.head()









    Out[42]:







  
    
      
      coffees
      weekdays
      coffees_made_today
      members
      coffees_per_person
    
  
  
    
      2011-10-04
      404.408262
      Tuesday
      34.255987
      14.0
      2.446856
    
    
      2011-10-05
      438.664249
      Wednesday
      55.402049
      14.0
      3.957289
    
    
      2011-10-06
      494.066298
      Thursday
      78.778013
      14.0
      5.627001
    
    
      2011-10-07
      572.844311
      Friday
      64.392941
      14.0
      4.599496
    
    
      2011-10-08
      637.237253
      Saturday
      7.492196
      14.0
      0.535157

We can now plot this column.



In [43]:

    
# Plot the coffees_per_person column
daily_data.coffees_per_person.plot()









    Out[43]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e1edd68>

Those are strange plateaus. We'll pull in another dataset, telling us when the machine was broken.



In [44]:

    
# pd.read_csv(); try using data/coffee_status.csv
# parse_dates as kwarg; also pass index_col
machine_status = pd.read_csv("data/coffee_status.csv", parse_dates=["date"], index_col="date")

machine_status.head()

Note : the parse_dates keyword argument takes several values. By passing in a list of strings, we're telling pandas to attempt to parse the dates in columns with those names.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

What values are in the `status` column ?



In [45]:

    
# .value_counts()
machine_status.status.value_counts()









    Out[45]:





OK        514
BROKEN     88
Name: status, dtype: int64

A quick trick to plot this as a time-series...



In [46]:

    
# Make a pd.Series from the status series where things are OK
numerical_status = machine_status.status == "OK"

numerical_status.plot()









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e321eb8>

Note : the first line here creates a boolean pd.Series, holding the value True when machine_status.status is "OK", and False otherwise. Because it's a pd.Series, its index stays the same as that of machine_status, which was a DatetimeIndex. Then, we can plot the boolean series ( True appearing as 1, and False appearing as 0 ), and just quickly scan to see that there are long areas where the coffee machine was operations, with short bouts ( thankfully ! ) of the machine being broken.

Let's join the datasets on the date field !



In [47]:

    
# .join()
daily_data = daily_data.join(machine_status)

daily_data.head()









    Out[47]:







  
    
      
      coffees
      weekdays
      coffees_made_today
      members
      coffees_per_person
      status
    
  
  
    
      2011-10-04
      404.408262
      Tuesday
      34.255987
      14.0
      2.446856
      OK
    
    
      2011-10-05
      438.664249
      Wednesday
      55.402049
      14.0
      3.957289
      OK
    
    
      2011-10-06
      494.066298
      Thursday
      78.778013
      14.0
      5.627001
      OK
    
    
      2011-10-07
      572.844311
      Friday
      64.392941
      14.0
      4.599496
      OK
    
    
      2011-10-08
      637.237253
      Saturday
      7.492196
      14.0
      0.535157
      OK

We'll bring in this numerical representation of status column into our dataframe too.



In [48]:

    
# Column depicting when the status was "OK"
# Cast the series to ints before as you create a new column in the dataframe
daily_data["numerical_status"] = (daily_data.status == "OK").astype(int)

daily_data.head()









    Out[48]:







  
    
      
      coffees
      weekdays
      coffees_made_today
      members
      coffees_per_person
      status
      numerical_status
    
  
  
    
      2011-10-04
      404.408262
      Tuesday
      34.255987
      14.0
      2.446856
      OK
      1
    
    
      2011-10-05
      438.664249
      Wednesday
      55.402049
      14.0
      3.957289
      OK
      1
    
    
      2011-10-06
      494.066298
      Thursday
      78.778013
      14.0
      5.627001
      OK
      1
    
    
      2011-10-07
      572.844311
      Friday
      64.392941
      14.0
      4.599496
      OK
      1
    
    
      2011-10-08
      637.237253
      Saturday
      7.492196
      14.0
      0.535157
      OK
      1

Let's plot both the coffees per person and the numerical status.



In [49]:

    
# Plot both columns on the same graph, using default args
daily_data[["coffees_per_person", "numerical_status"]].plot()









    Out[49]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e482dd8>

We see a strong weekday-weekend effect. Resampling weekly will fix that.



In [50]:

    
# Resample weekly, taking the mean of each week to get a weekly value
weekly_data = daily_data.resample("W").mean()

weekly_data[["coffees_per_person", "numerical_status"]].plot()









    Out[50]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e608e80>

What have we achieved ?

Cleaning

Cast columns to the correct dtypes
Dropped rows with no data
Truncated the time-series when things got sparse

Exploring the contributions

Discovered who contributed to the dataset and how much
Established how contributions varied by day of the week

Exploring the time-series

Resampled the uneven time-series to regular daily intervals
Interpolated over missing data using our time index
Discovered on what days of the week coffee was especially popular
Joined the dataset with others to enrich the information available
Derived new columns from existing ones
Smoothed the time-series by weekly downsampling

Insights

A small number of people contributed most of the data
Contributions were lacking at the weekends, and Mondays had the most contributions
Seminar Wednesdays is a strong driver of coffee consumption
Periods of low coffee drinking correlated strongly with periods where the machine was broken
A significant dip in consumption occurred in the summer months

	timestamp	coffees	contributor
0	2011-10-03 08:22:00	397.0	Quentin
1	2011-10-04 11:48:00	410.0	Quentin
2	2011-10-05 07:02:00	testing	Anthony
3	2011-10-05 08:25:00	NaN	Quentin
4	2011-10-05 10:47:00	464.0	Quentin
5	2011-10-05 13:15:00	481.0	Quentin
6	2011-10-06 07:21:00	503.0	Anthony
7	2011-10-06 10:04:00	513.0	Quentin
8	2011-10-06 12:14:00	539.0	Mike
9	2011-10-06 12:49:00	540.0	Quentin
10	2011-10-06 14:52:00	563.0	Ben
11	2011-10-07 07:34:00	581.0	Anthony
12	2011-10-07 08:37:00	587.0	Quentin
13	2011-10-07 11:09:00	605.0	Quentin
14	2011-10-07 13:14:00	616.0	Mike
15	2011-10-07 14:10:00	NaN	Ben
16	2011-10-07 15:20:00	626.0	Mike M
17	2011-10-07 16:50:00	635.0	Mike M
18	2011-10-09 16:53:00	650.0	Colm
19	2011-10-10 07:29:00	656.0	Anthony
20	2011-10-10 10:13:00	673.0	Quentin
21	2011-10-10 13:41:00	694.0	Mike M
22	2011-10-10 14:02:00	699.0	Quentin
23	2011-10-10 15:23:00	713.0	Quentin
24	2011-10-11 14:09:00	770.0	Mike M
25	2011-10-12 08:11:00	790.0	Quentin
26	2011-10-12 09:57:00	799.0	Mike M
27	2011-10-12 10:06:00	805.0	Sergio
28	2011-10-12 12:01:00	818.0	Mike M
29	2011-10-12 12:30:00	819.0	Quentin
...	...	...	...
641	2013-01-28 10:43:00	NaN	Sergio
642	2013-01-28 13:13:00	NaN	Quentin
643	2013-01-28 14:01:00	16195.0	Quentin
644	2013-01-29 13:43:00	16237.0	Quentin
645	2013-01-29 15:06:00	16257.0	Quentin
646	2013-02-04 13:25:00	16513.0	Sergio
647	2013-02-06 17:33:00	16659.0	Quentin
648	2013-02-07 13:30:00	16714.0	Sergio
649	2013-02-12 08:36:00	16891.0	Sergio
650	2013-02-12 11:39:00	16909.0	Quentin
651	2013-02-13 13:58:00	16977.0	Quentin
652	2013-02-16 11:55:00	17104.0	Quentin
653	2013-02-18 12:04:00	NaN	Quentin
654	2013-02-18 13:46:00	17165.0	Quentin
655	2013-02-21 13:44:00	17345.0	Quentin
656	2013-02-21 15:02:00	17354.0	Quentin
657	2013-02-25 13:33:00	17468.0	Quentin
658	2013-02-25 17:25:00	17489.0	Quentin
659	2013-02-27 09:33:00	17564.0	Quentin
660	2013-03-04 10:46:00	17789.0	Sergio
661	2013-03-04 11:12:00	17793.0	Quentin
662	2013-03-04 16:43:00	17824.0	Quentin
663	2013-03-05 10:42:00	17852.0	Quentin
664	2013-03-05 13:29:00	17868.0	Quentin
665	2013-03-08 10:28:00	18062.0	Quentin
666	2013-03-12 08:28:00	18235.0	Sergio
667	2013-04-05 11:20:00	18942.0	Sergio
668	2013-04-27 11:04:00	19698.0	Sergio
669	2013-09-12 15:38:00	24450.0	Quentin
670	2013-09-13 10:28:00	24463.0	Quentin

	timestamp	coffees	contributor
count	671	658	671
unique	671	654	9
top	2012-09-26 16:02:00	12358.0	Quentin
freq	1	2	367

	timestamp	coffees	contributor
count	657	657.000000	657
unique	657	NaN	9
top	2011-10-24 14:32:00	NaN	Quentin
freq	1	NaN	361
first	2011-10-03 08:22:00	NaN	NaN
last	2013-09-13 10:28:00	NaN	NaN
mean	NaN	8568.471842	NaN
std	NaN	4600.215049	NaN
min	NaN	397.000000	NaN
25%	NaN	4986.000000	NaN
50%	NaN	9172.000000	NaN
75%	NaN	11562.000000	NaN
max	NaN	24463.000000	NaN

	timestamp	coffees	contributor
661	2013-03-04 11:12:00	17793	Quentin
662	2013-03-04 16:43:00	17824	Quentin
663	2013-03-05 10:42:00	17852	Quentin
664	2013-03-05 13:29:00	17868	Quentin
665	2013-03-08 10:28:00	18062	Quentin
666	2013-03-12 08:28:00	18235	Sergio
667	2013-04-05 11:20:00	18942	Sergio
668	2013-04-27 11:04:00	19698	Sergio
669	2013-09-12 15:38:00	24450	Quentin
670	2013-09-13 10:28:00	24463	Quentin

	timestamp	coffees	contributor
655	2013-02-21 13:44:00	17345	Quentin
656	2013-02-21 15:02:00	17354	Quentin
657	2013-02-25 13:33:00	17468	Quentin
658	2013-02-25 17:25:00	17489	Quentin
659	2013-02-27 09:33:00	17564	Quentin

	timestamp	coffees	contributor	weekdays
0	2011-10-03 08:22:00	397	Quentin	Monday
1	2011-10-04 11:48:00	410	Quentin	Tuesday
4	2011-10-05 10:47:00	464	Quentin	Wednesday
5	2011-10-05 13:15:00	481	Quentin	Wednesday
6	2011-10-06 07:21:00	503	Anthony	Thursday

	timestamp	coffees	contributor
weekdays
Monday	145	145	145
Tuesday	97	97	97
Wednesday	131	131	131
Thursday	117	117	117
Friday	114	114	114
Saturday	29	29	29
Sunday	13	13	13

	coffees	contributor	weekdays
timestamp
2011-10-03 08:22:00	397	Quentin	Monday
2011-10-04 11:48:00	410	Quentin	Tuesday
2011-10-05 10:47:00	464	Quentin	Wednesday
2011-10-05 13:15:00	481	Quentin	Wednesday
2011-10-06 07:21:00	503	Anthony	Thursday

	coffees	contributor	weekdays
2011-10-03 00:00:00	NaN	NaN	NaN
2011-10-03 08:22:00	397.0	Quentin	Monday
2011-10-04 00:00:00	NaN	NaN	NaN
2011-10-04 11:48:00	410.0	Quentin	Tuesday
2011-10-05 00:00:00	NaN	NaN	NaN
2011-10-05 10:47:00	464.0	Quentin	Wednesday
2011-10-05 13:15:00	481.0	Quentin	Wednesday
2011-10-06 00:00:00	NaN	NaN	NaN
2011-10-06 07:21:00	503.0	Anthony	Thursday
2011-10-06 10:04:00	513.0	Quentin	Thursday

	coffees	weekdays
2011-10-03	NaN	Monday
2011-10-04	404.408262	Tuesday
2011-10-05	438.664249	Wednesday
2011-10-06	494.066298	Thursday
2011-10-07	572.844311	Friday

	members
date
2011-08-31	12
2011-09-30	14
2011-10-31	14
2011-11-30	14
2011-12-31	15

	coffees	weekdays	coffees_made_today	members
2011-08-31	NaN	NaN	NaN	12.0
2011-09-30	NaN	NaN	NaN	14.0
2011-10-03	NaN	Monday	NaN	14.0
2011-10-04	404.408262	Tuesday	34.255987	14.0
2011-10-05	438.664249	Wednesday	55.402049	14.0

	status
date
2011-09-16	OK
2011-09-17	OK
2011-09-18	OK
2011-09-19	OK
2011-09-20	OK