Unsupervised Analysis of Days of Week

Treating crossings each day of features to learn about the relationships between various days


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')

import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture

Get Data


In [2]:
from jupyterworkflow.data import get_fremont_data
data = get_fremont_data()

In [3]:
pivoted = data.pivot_table('Total', index=data.index.time, columns = data.index.date)
pivoted.plot(legend=False,alpha=0.01)


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x2764fa68518>

Principal Components Analysis


In [4]:
X = pivoted.fillna(0).T.values
X.shape


Out[4]:
(1610, 24)

In [5]:
X2 =PCA(2, svd_solver='full').fit_transform(X)
X2.shape


Out[5]:
(1610, 2)

In [6]:
import matplotlib.pyplot as plt
plt.scatter(X2[:, 0], X2[:, 1])


Out[6]:
<matplotlib.collections.PathCollection at 0x276525c46a0>

Unsupervised Clustering


In [7]:
gmm = GaussianMixture(2).fit(X)
labels = gmm.predict(X)
np.unique(labels)


Out[7]:
array([0, 1], dtype=int64)

In [8]:
plt.scatter(X2[:, 0], X2[:, 1], c=labels, cmap='rainbow')
plt.colorbar()


Out[8]:
<matplotlib.colorbar.Colorbar at 0x2765275a2e8>

In [9]:
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

pivoted.T[labels == 0].T.plot(legend=False, alpha=0.1, ax=ax[0])
pivoted.T[labels == 1].T.plot(legend=False, alpha=0.1, ax=ax[1])

ax[0].set_title('Purple Cluster')
ax[1].set_title('Red Cluster')


Out[9]:
<matplotlib.text.Text at 0x276526fa978>

Comparing with Day of Week


In [10]:
pd.DatetimeIndex(pivoted.columns).dayofweek


Out[10]:
array([2, 3, 4, ..., 6, 0, 1])

In [11]:
dayofweek = pd.DatetimeIndex(pivoted.columns).dayofweek

In [12]:
plt.scatter(X2[:, 0], X2[:, 1], c=dayofweek, cmap='rainbow')
plt.colorbar()


Out[12]:
<matplotlib.colorbar.Colorbar at 0x2765693ecf8>
  • 0-4 weekdays
  • 5, 6 weekend

Analyzing Outliers

The following points are weekdays with a holiday-like pattern


In [13]:
dates = pd.DatetimeIndex(pivoted.columns)
dates[(labels==1) & (dayofweek<5)]


Out[13]:
DatetimeIndex(['2012-11-22', '2012-11-23', '2012-12-24', '2012-12-25',
               '2013-01-01', '2013-05-27', '2013-07-04', '2013-07-05',
               '2013-09-02', '2013-11-28', '2013-11-29', '2013-12-20',
               '2013-12-24', '2013-12-25', '2014-01-01', '2014-04-23',
               '2014-05-26', '2014-07-04', '2014-09-01', '2014-11-27',
               '2014-11-28', '2014-12-24', '2014-12-25', '2014-12-26',
               '2015-01-01', '2015-05-25', '2015-07-03', '2015-09-07',
               '2015-11-26', '2015-11-27', '2015-12-24', '2015-12-25',
               '2016-01-01', '2016-05-30', '2016-07-04', '2016-09-05',
               '2016-11-24', '2016-11-25', '2016-12-26', '2017-01-02',
               '2017-02-06'],
              dtype='datetime64[ns]', freq=None)

In [ ]: