NYC Energy Data Meetup : #LiveAnalytic

hello spectators & judges!

pull the data into pandas DataFrames


In [1]:
import pandas as pd
%pylab inline
df1 = pd.read_csv('./data/20140623/NYCEnergyData_dataset_062314_0.csv')
df2 = pd.read_csv('./data/20140623/NYCEnergyData_dataset_062314_NA.csv')


Populating the interactive namespace from numpy and matplotlib

take a quick peak at the data


In [2]:
df1.head()


Out[2]:
ROW.ID RTD.End.Time.Stamp HDM DOW Year Month Day Hour Load.kW TEMP.F ... CHC1 CHC2 CHC3 CIG.feet HI6...F LO6...F PEAK.mph HI24...F LO24...F DWP..F
0 1 1/1/2011 0:05 0_1_1 Sat 2011 1 1 1 5267.6 35.1 ... 1 0 0 0 39 35.1 0 0 0 28.0
1 2 1/1/2011 1:00 1_1_1 Sat 2011 1 1 2 5132.2 33.1 ... 1 0 0 0 0 0.0 0 0 0 27.0
2 3 1/1/2011 2:00 2_1_1 Sat 2011 1 1 3 4905.6 34.0 ... 1 0 0 0 0 0.0 0 0 0 28.0
3 4 1/1/2011 3:00 3_1_1 Sat 2011 1 1 4 4716.8 35.1 ... 1 0 0 0 0 0.0 0 0 0 28.9
4 5 1/1/2011 4:00 4_1_1 Sat 2011 1 1 5 4594.8 35.1 ... 2506 0 0 0 0 0.0 0 0 0 28.0

5 rows × 36 columns


In [3]:
df2.head()


Out[3]:
ROW.ID RTD.End.Time.Stamp HDM DOW Year Month Day Hour Load.kW TEMP(F) ... CHC1 CHC2 CHC3 CIG.feet HI6...F LO6...F PEAK.mph HI24...F LO24...F DWP..F
0 1 1/1/2011 0:05 0_1_1 Sat 2011 1 1 1 5267.6 35.1 ... 1 0 0 0 39 35.1 0 0 0 28.0
1 2 1/1/2011 1:00 1_1_1 Sat 2011 1 1 2 5132.2 33.1 ... 1 0 0 0 0 0.0 0 0 0 27.0
2 3 1/1/2011 2:00 2_1_1 Sat 2011 1 1 3 4905.6 34.0 ... 1 0 0 0 0 0.0 0 0 0 28.0
3 4 1/1/2011 3:00 3_1_1 Sat 2011 1 1 4 4716.8 35.1 ... 1 0 0 0 0 0.0 0 0 0 28.9
4 5 1/1/2011 4:00 4_1_1 Sat 2011 1 1 5 4594.8 35.1 ... 2506 0 0 0 0 0.0 0 0 0 28.0

5 rows × 36 columns

ah, Lucas indicates that _NA is full of NaN, and _0 has been "cleaned"; lets stick with the raw data and let pandas help us out

use DataFrame.set_index to make this officially a timeseries :)


In [4]:
df2.set_index(df2['RTD.End.Time.Stamp'].map(lambda x: pd.Timestamp(x)), inplace=True)
df2.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26301 entries, 2011-01-01 00:05:00 to 2013-12-31 23:00:00
Data columns (total 36 columns):
ROW.ID                26301 non-null int64
RTD.End.Time.Stamp    26301 non-null object
HDM                   26301 non-null object
DOW                   26301 non-null object
Year                  26301 non-null int64
Month                 26301 non-null int64
Day                   26301 non-null int64
Hour                  26301 non-null int64
Load.kW               26277 non-null float64
TEMP(F)               26080 non-null float64
RELH..                26081 non-null float64
SKNT.mph              26078 non-null float64
GUST.mph              4770 non-null float64
DRCT..                24697 non-null float64
QFLG                  26076 non-null object
PMSL.in               23299 non-null float64
ALTI.in               26106 non-null float64
P03D.in               7944 non-null float64
PCHA                  279 non-null float64
WNUM                  26145 non-null object
VSBY.miles            26104 non-null float64
P01I.in               3895 non-null float64
P03I.in               894 non-null float64
P06I.in               1105 non-null float64
P24I.in               693 non-null float64
SNOW.in               536 non-null float64
CHC1                  25861 non-null float64
CHC2                  15533 non-null float64
CHC3                  7501 non-null float64
CIG.feet              15544 non-null float64
HI6...F               4188 non-null float64
LO6...F               4188 non-null float64
PEAK.mph              1380 non-null float64
HI24...F              1378 non-null float64
LO24...F              1378 non-null float64
DWP..F                26080 non-null float64
dtypes: float64(26), int64(5), object(5)

how granular are these observations?


In [5]:
df2['Load.kW'].head(10)


Out[5]:
RTD.End.Time.Stamp
2011-01-01 00:05:00    5267.6
2011-01-01 01:00:00    5132.2
2011-01-01 02:00:00    4905.6
2011-01-01 03:00:00    4716.8
2011-01-01 04:00:00    4594.8
2011-01-01 05:00:00    4577.3
2011-01-01 06:00:00    4715.9
2011-01-01 07:00:00    4699.9
2011-01-01 08:00:00    4844.1
2011-01-01 09:00:00    5021.7
Name: Load.kW, dtype: float64

ok, roughly hourly, and as we can see above there are a few non-numeric columns, what are these?


In [6]:
df2.QFLG.value_counts()


Out[6]:
OK         26065
Caution       11
dtype: int64

ok, Lucas says QFLG is a "data quality check", how about WNUM?


In [7]:
df2.WNUM.value_counts()


Out[7]:
mostly cloudy      7963
partly cloudy      5191
overcast           3862
mostly clear       3191
clear              1902
fog                1232
lt rain             970
lt rain; fog        505
lt drizzle; fog     207
mod rain; fog       194
lt snow             182
haze                163
lt snow; fog        115
hvy rain; fog        88
mod rain             48
...
; mod ice pellet; fog             1
squalls                           1
mod thunder shwr; squalls         1
; mod snow pellet; mod rain       1
; mod rain; fog                   1
; blowing snow; ice fog           1
thunder; squalls                  1
lt ice pellets                    1
lt snow; lt ice pellets           1
; fog; blowing snow               1
lt snow; mod ice pellet           1
lt frz rain; mod ice pellet       1
; lt snow; lt ice pellets         1
lt rain; mod snow                 1
lt frz drizzle; lt ice pellets    1
Length: 65, dtype: int64

looks fun, maybe we'll come back to this

rename some columns, this will allow for less typing when indexing/plotting these


In [8]:
df2.rename(columns={'Load.kW':'Load', 'TEMP(F)':'Temp' }, inplace=True)

Histogram of Load values

here i am using an awesome python visualization library built atop matplotlib called seaborn

http://web.stanford.edu/~mwaskom/software/seaborn/


In [9]:
import seaborn as sns
df2.Load.hist(figsize=(15,6), bins=100)


Out[9]:
<matplotlib.axes.AxesSubplot at 0x10bc96f50>

Histogram of Temperature values


In [10]:
df2.Temp.hist(figsize=(15,6), bins=100)


Out[10]:
<matplotlib.axes.AxesSubplot at 0x107736650>

Timeseries plot of Temperature


In [11]:
df2.Temp.plot(figsize=(15,6))


Out[11]:
<matplotlib.axes.AxesSubplot at 0x10bd0fed0>

so we see some variance, and a few crazy outliers, e.g. did it really get below 20F in early May 2013?

Timeseries plot of Load


In [12]:
df2.Load.plot(figsize=(15,6))


Out[12]:
<matplotlib.axes.AxesSubplot at 0x10c23d350>

we see a big drop in the fall of 2012.... This is likely Superstorm Sandy

Let's look at the predictive value of our weather data vs Load

what were those columns again??


In [13]:
df2.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26301 entries, 2011-01-01 00:05:00 to 2013-12-31 23:00:00
Data columns (total 36 columns):
ROW.ID                26301 non-null int64
RTD.End.Time.Stamp    26301 non-null object
HDM                   26301 non-null object
DOW                   26301 non-null object
Year                  26301 non-null int64
Month                 26301 non-null int64
Day                   26301 non-null int64
Hour                  26301 non-null int64
Load                  26277 non-null float64
Temp                  26080 non-null float64
RELH..                26081 non-null float64
SKNT.mph              26078 non-null float64
GUST.mph              4770 non-null float64
DRCT..                24697 non-null float64
QFLG                  26076 non-null object
PMSL.in               23299 non-null float64
ALTI.in               26106 non-null float64
P03D.in               7944 non-null float64
PCHA                  279 non-null float64
WNUM                  26145 non-null object
VSBY.miles            26104 non-null float64
P01I.in               3895 non-null float64
P03I.in               894 non-null float64
P06I.in               1105 non-null float64
P24I.in               693 non-null float64
SNOW.in               536 non-null float64
CHC1                  25861 non-null float64
CHC2                  15533 non-null float64
CHC3                  7501 non-null float64
CIG.feet              15544 non-null float64
HI6...F               4188 non-null float64
LO6...F               4188 non-null float64
PEAK.mph              1380 non-null float64
HI24...F              1378 non-null float64
LO24...F              1378 non-null float64
DWP..F                26080 non-null float64
dtypes: float64(26), int64(5), object(5)

lets take a small slice of this data to make our tests fast


In [14]:
df_slice = df2[pd.Timestamp('20130101'):pd.Timestamp('20130108')]
df_slice.Load.plot()


Out[14]:
<matplotlib.axes.AxesSubplot at 0x10c6483d0>

OK, good we have just over a week of data, and the weekends are clearly lower-load

let's grab all of the numeric columns. we want everything after Day, except QFLG and WNUM. i'm going to just do this manually.


In [15]:
cols = df_slice.columns[8:]
cols


Out[15]:
Index([u'Load', u'Temp', u'RELH..', u'SKNT.mph', u'GUST.mph', u'DRCT..', u'QFLG', u'PMSL.in', u'ALTI.in', u'P03D.in', u'PCHA', u'WNUM', u'VSBY.miles', u'P01I.in', u'P03I.in', u'P06I.in', u'P24I.in', u'SNOW.in', u'CHC1', u'CHC2', u'CHC3', u'CIG.feet', u'HI6...F', u'LO6...F', u'PEAK.mph', u'HI24...F', u'LO24...F', u'DWP..F'], dtype='object')

In [16]:
cols = [u'Load', u'Temp', u'RELH..', u'SKNT.mph', u'GUST.mph', u'DRCT..', u'PMSL.in', u'ALTI.in', u'P03D.in', u'PCHA', u'VSBY.miles', u'P01I.in', u'P03I.in', u'P06I.in', u'P24I.in', u'SNOW.in', u'CHC1', u'CHC2', u'CHC3', u'CIG.feet', u'HI6...F', u'LO6...F', u'PEAK.mph', u'HI24...F', u'LO24...F', u'DWP..F']

In [17]:
df_clean_slice = df_slice.query("QFLG == 'OK'")[cols]
df_clean_slice.head()


Out[17]:
Load Temp RELH.. SKNT.mph GUST.mph DRCT.. PMSL.in ALTI.in P03D.in PCHA ... CHC1 CHC2 CHC3 CIG.feet HI6...F LO6...F PEAK.mph HI24...F LO24...F DWP..F
RTD.End.Time.Stamp
2013-01-01 00:05:00 5214.3 39.9 57 12.7 NaN 250 29.89 29.89 236.48 NaN ... 654 NaN NaN 6500 39.9 37.9 NaN NaN NaN 26.1
2013-01-01 01:00:00 5045.7 39.0 59 12.7 NaN 260 29.90 29.90 NaN NaN ... 604 NaN NaN 6000 NaN NaN NaN NaN NaN 26.1
2013-01-01 02:00:00 4831.3 39.0 59 11.5 NaN 270 29.90 29.90 NaN NaN ... 463 803 2504 4600 NaN NaN NaN NaN NaN 26.1
2013-01-01 03:00:00 4640.5 39.9 59 15.0 NaN 260 29.91 29.91 29.68 NaN ... 464 NaN NaN 4600 NaN NaN NaN NaN NaN 27.0
2013-01-01 04:00:00 4501.7 39.9 62 17.3 NaN 250 29.90 29.90 NaN NaN ... 463 554 NaN 4600 NaN NaN NaN NaN NaN 28.0

5 rows × 26 columns

Correlation matrix

back to the whole dataset, lets try this QFLG column


In [18]:
df_clean = df2.query("QFLG == 'OK'").copy()

generate the correlation matrix plot using seaborn's corrplot

http://web.stanford.edu/~mwaskom/software/seaborn/examples/many_pairwise_correlations.html


In [19]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")

f, ax = plt.subplots(figsize=(9, 9))
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
                          "#FFE6F8", "#C71585", "#8B0000"], as_cmap=True)
sns.corrplot(df_clean[cols], annot=False, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()


OK, some of the other variables seem strongly correlated with Load, interesting, what are these?


In [20]:
df2['HI6...F'].hist()


Out[20]:
<matplotlib.axes.AxesSubplot at 0x10cdc9810>

thats a lot of zeros, how many of these values are good?


In [21]:
df2['HI6...F'].map(lambda x: x > 0.0).value_counts()


Out[21]:
False    22466
True      3835
dtype: int64

OK nevermind, this is junk.

running low on time here, let's move on, how about we make this a daily timeseries, taking the mean of every numeric


In [22]:
df_clean_daily = df_clean.resample('D', how=np.mean)
df_clean_daily.Temp.plot()


Out[22]:
<matplotlib.axes.AxesSubplot at 0x10c64a190>

In [23]:
df_clean_daily.head()


Out[23]:
ROW.ID Year Month Day Hour Load Temp RELH.. SKNT.mph GUST.mph ... CHC1 CHC2 CHC3 CIG.feet HI6...F LO6...F PEAK.mph HI24...F LO24...F DWP..F
RTD.End.Time.Stamp
2011-01-01 12.5 2011 1 1 12.5 5269.729167 38.829167 79.875000 7.062500 0.000000 ... 995.416667 1093.166667 0.000 10333.333333 6.712500 5.920833 0 1.791667 1.379167 33.104167
2011-01-02 36.5 2011 1 2 12.5 5298.033333 42.041667 83.291667 7.258333 2.733333 ... 159.708333 278.958333 305.125 2775.000000 1.754167 1.662500 0 2.037500 1.500000 37.012500
2011-01-03 60.5 2011 1 3 12.5 6132.954167 32.241667 43.708333 16.658333 14.050000 ... 715.000000 0.000000 0.000 3125.000000 6.216667 5.129167 0 1.500000 1.166667 12.395833
2011-01-04 84.5 2011 1 4 12.5 6192.225000 33.670833 58.500000 9.466667 0.000000 ... 1399.916667 1798.000000 417.125 8541.666667 5.825000 5.075000 0 1.579167 1.166667 20.650000
2011-01-05 108.5 2011 1 5 12.5 6166.900000 34.279167 49.833333 11.333333 1.775000 ... 914.666667 1318.208333 208.500 9583.333333 6.287500 5.287500 0 1.662500 1.166667 16.945833

5 rows × 31 columns

Daily Data : Scatter Plot w/ Histograms & Linear Model


In [24]:
color = sns.color_palette()[2]
g = sns.jointplot("Temp", "Load", data=df_clean_daily, kind="reg",
                  color=color, size=15)


woah, crazy boomerang!

we'll come back to this, maybe try a polynomial fit, in the meantime lets look at some more variables


In [25]:
df_clean['RELH..'].hist()


Out[25]:
<matplotlib.axes.AxesSubplot at 0x10cf0ecd0>

In [26]:
df_clean['DWP..F'].hist()


Out[26]:
<matplotlib.axes.AxesSubplot at 0x10cdec390>

A detour through some of the example questions

create a day-of-week column for some boxplots


In [27]:
df_clean['dow'] = df_clean.index.map(lambda x: x.weekday())
df_clean.dow.describe()


Out[27]:
count    26065.000000
mean         3.000269
std          2.000566
min          0.000000
25%          1.000000
50%          3.000000
75%          5.000000
max          6.000000
dtype: float64

Load by Day-of-Week


In [28]:
df_clean.boxplot(column='Load', by='dow', figsize=(15,6))


Out[28]:
<matplotlib.axes.AxesSubplot at 0x10eb15810>

Load by Month


In [29]:
import matplotlib.pyplot as plt
fig = plt.figure()
df_clean.boxplot(column='Load', by='Month', figsize=(15,6))
fig.title='Load by Month'


<matplotlib.figure.Figure at 0x10f40b210>

Load by Hour-of-Day


In [30]:
df_clean[df_clean.Month.map(lambda x: x in range(6,8))].boxplot(column='Load', by='Hour', figsize=(15,6))


Out[30]:
<matplotlib.axes.AxesSubplot at 0x10f3e0690>

back to some modeling

Here's a closer look at the linear regressions that seaborn plotted for us above


In [31]:
import statsmodels.formula.api as smf

In [32]:
import statsmodels.api as sm

Temp vs Load


In [33]:
model = smf.ols(formula = "Load ~ Temp", data = df_clean_daily)
fitted = model.fit()
fitted.summary()


Out[33]:
OLS Regression Results
Dep. Variable: Load R-squared: 0.408
Model: OLS Adj. R-squared: 0.407
Method: Least Squares F-statistic: 752.5
Date: Mon, 30 Jun 2014 Prob (F-statistic): 1.91e-126
Time: 21:29:43 Log-Likelihood: -8845.7
No. Observations: 1094 AIC: 1.770e+04
Df Residuals: 1092 BIC: 1.771e+04
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 3965.6361 85.485 46.390 0.000 3797.902 4133.370
Temp 40.4373 1.474 27.431 0.000 37.545 43.330
Omnibus: 35.251 Durbin-Watson: 0.412
Prob(Omnibus): 0.000 Jarque-Bera (JB): 38.044
Skew: 0.456 Prob(JB): 5.48e-09
Kurtosis: 3.048 Cond. No. 209.

interesting, lets keep this one

Other variables

full disclosure, during the event, i had a bogus correlation plot (on the 1 week of data), so got distracted by some highly negative correlations with the following variables

Load vs PMSL


In [34]:
df_clean_daily['Humid'] = df_clean_daily['RELH..']
df_clean_daily['PMSL'] = df_clean_daily['PMSL.in']
model = smf.ols(formula = "Load ~ PMSL", data = df_clean_daily)
fitted = model.fit()
fitted.summary()


Out[34]:
OLS Regression Results
Dep. Variable: Load R-squared: 0.002
Model: OLS Adj. R-squared: 0.002
Method: Least Squares F-statistic: 2.733
Date: Mon, 30 Jun 2014 Prob (F-statistic): 0.0986
Time: 21:29:46 Log-Likelihood: -9131.1
No. Observations: 1094 AIC: 1.827e+04
Df Residuals: 1092 BIC: 1.828e+04
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 8326.7166 1275.865 6.526 0.000 5823.292 1.08e+04
PMSL -70.3419 42.548 -1.653 0.099 -153.827 13.144
Omnibus: 215.024 Durbin-Watson: 0.246
Prob(Omnibus): 0.000 Jarque-Bera (JB): 363.404
Skew: 1.253 Prob(JB): 1.22e-79
Kurtosis: 4.300 Cond. No. 1.24e+03

Load vs Humidity


In [35]:
model = smf.ols(formula = "Load ~ Humid", data = df_clean_daily)
fitted = model.fit()
fitted.summary()


Out[35]:
OLS Regression Results
Dep. Variable: Load R-squared: 0.031
Model: OLS Adj. R-squared: 0.030
Method: Least Squares F-statistic: 35.27
Date: Mon, 30 Jun 2014 Prob (F-statistic): 3.85e-09
Time: 21:29:48 Log-Likelihood: -9115.1
No. Observations: 1094 AIC: 1.823e+04
Df Residuals: 1092 BIC: 1.824e+04
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 5473.7554 128.957 42.446 0.000 5220.723 5726.788
Humid 11.4073 1.921 5.939 0.000 7.639 15.176
Omnibus: 221.449 Durbin-Watson: 0.260
Prob(Omnibus): 0.000 Jarque-Bera (JB): 388.584
Skew: 1.255 Prob(JB): 4.17e-85
Kurtosis: 4.491 Cond. No. 285.

OK both of these are very weak, forget 'em

Load vs. Temp, by Month

with <10 min left, i had the idea of fitting a model by month

here we use pandas groupby to get a dataframe per month-of-year


In [36]:
grp = df_clean_daily.groupby('Month')
len(grp)


Out[36]:
12

and we create a jointplot per-month


In [37]:
models = {}
fitteds = {}

for (key, df_g) in grp:
    g = sns.jointplot("Temp", "Load", data=df_g, kind="reg",
                  color=color, size=15)
    g.ax_joint.set_title("Month == %s" % key)
    #models[key] = smf.ols(formula = "Load ~ Temp", data = df_g)
    #fitteds[key] = models[key].fit()
    #print fitteds[key].summary()