pull the data into pandas DataFrames
In [1]:
import pandas as pd
%pylab inline
df1 = pd.read_csv('./data/20140623/NYCEnergyData_dataset_062314_0.csv')
df2 = pd.read_csv('./data/20140623/NYCEnergyData_dataset_062314_NA.csv')
take a quick peak at the data
In [2]:
df1.head()
Out[2]:
In [3]:
df2.head()
Out[3]:
ah, Lucas indicates that _NA is full of NaN, and _0 has been "cleaned"; lets stick with the raw data and let pandas help us out
use DataFrame.set_index to make this officially a timeseries :)
In [4]:
df2.set_index(df2['RTD.End.Time.Stamp'].map(lambda x: pd.Timestamp(x)), inplace=True)
df2.info()
how granular are these observations?
In [5]:
df2['Load.kW'].head(10)
Out[5]:
ok, roughly hourly, and as we can see above there are a few non-numeric columns, what are these?
In [6]:
df2.QFLG.value_counts()
Out[6]:
ok, Lucas says QFLG is a "data quality check", how about WNUM?
In [7]:
df2.WNUM.value_counts()
Out[7]:
looks fun, maybe we'll come back to this
rename some columns, this will allow for less typing when indexing/plotting these
In [8]:
df2.rename(columns={'Load.kW':'Load', 'TEMP(F)':'Temp' }, inplace=True)
here i am using an awesome python visualization library built atop matplotlib called seaborn
In [9]:
import seaborn as sns
df2.Load.hist(figsize=(15,6), bins=100)
Out[9]:
In [10]:
df2.Temp.hist(figsize=(15,6), bins=100)
Out[10]:
In [11]:
df2.Temp.plot(figsize=(15,6))
Out[11]:
so we see some variance, and a few crazy outliers, e.g. did it really get below 20F in early May 2013?
In [12]:
df2.Load.plot(figsize=(15,6))
Out[12]:
we see a big drop in the fall of 2012.... This is likely Superstorm Sandy
what were those columns again??
In [13]:
df2.info()
lets take a small slice of this data to make our tests fast
In [14]:
df_slice = df2[pd.Timestamp('20130101'):pd.Timestamp('20130108')]
df_slice.Load.plot()
Out[14]:
OK, good we have just over a week of data, and the weekends are clearly lower-load
let's grab all of the numeric columns. we want everything after Day, except QFLG and WNUM. i'm going to just do this manually.
In [15]:
cols = df_slice.columns[8:]
cols
Out[15]:
In [16]:
cols = [u'Load', u'Temp', u'RELH..', u'SKNT.mph', u'GUST.mph', u'DRCT..', u'PMSL.in', u'ALTI.in', u'P03D.in', u'PCHA', u'VSBY.miles', u'P01I.in', u'P03I.in', u'P06I.in', u'P24I.in', u'SNOW.in', u'CHC1', u'CHC2', u'CHC3', u'CIG.feet', u'HI6...F', u'LO6...F', u'PEAK.mph', u'HI24...F', u'LO24...F', u'DWP..F']
In [17]:
df_clean_slice = df_slice.query("QFLG == 'OK'")[cols]
df_clean_slice.head()
Out[17]:
back to the whole dataset, lets try this QFLG column
In [18]:
df_clean = df2.query("QFLG == 'OK'").copy()
generate the correlation matrix plot using seaborn's corrplot
http://web.stanford.edu/~mwaskom/software/seaborn/examples/many_pairwise_correlations.html
In [19]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(9, 9))
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
"#FFE6F8", "#C71585", "#8B0000"], as_cmap=True)
sns.corrplot(df_clean[cols], annot=False, sig_stars=False,
diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()
OK, some of the other variables seem strongly correlated with Load, interesting, what are these?
In [20]:
df2['HI6...F'].hist()
Out[20]:
thats a lot of zeros, how many of these values are good?
In [21]:
df2['HI6...F'].map(lambda x: x > 0.0).value_counts()
Out[21]:
OK nevermind, this is junk.
running low on time here, let's move on, how about we make this a daily timeseries, taking the mean of every numeric
In [22]:
df_clean_daily = df_clean.resample('D', how=np.mean)
df_clean_daily.Temp.plot()
Out[22]:
In [23]:
df_clean_daily.head()
Out[23]:
here we use seaborn's jointplot
http://web.stanford.edu/~mwaskom/software/seaborn/examples/regression_marginals.html
In [24]:
color = sns.color_palette()[2]
g = sns.jointplot("Temp", "Load", data=df_clean_daily, kind="reg",
color=color, size=15)
we'll come back to this, maybe try a polynomial fit, in the meantime lets look at some more variables
In [25]:
df_clean['RELH..'].hist()
Out[25]:
In [26]:
df_clean['DWP..F'].hist()
Out[26]:
create a day-of-week column for some boxplots
In [27]:
df_clean['dow'] = df_clean.index.map(lambda x: x.weekday())
df_clean.dow.describe()
Out[27]:
In [28]:
df_clean.boxplot(column='Load', by='dow', figsize=(15,6))
Out[28]:
In [29]:
import matplotlib.pyplot as plt
fig = plt.figure()
df_clean.boxplot(column='Load', by='Month', figsize=(15,6))
fig.title='Load by Month'
In [30]:
df_clean[df_clean.Month.map(lambda x: x in range(6,8))].boxplot(column='Load', by='Hour', figsize=(15,6))
Out[30]:
Here's a closer look at the linear regressions that seaborn plotted for us above
In [31]:
import statsmodels.formula.api as smf
In [32]:
import statsmodels.api as sm
Temp vs Load
In [33]:
model = smf.ols(formula = "Load ~ Temp", data = df_clean_daily)
fitted = model.fit()
fitted.summary()
Out[33]:
interesting, lets keep this one
full disclosure, during the event, i had a bogus correlation plot (on the 1 week of data), so got distracted by some highly negative correlations with the following variables
In [34]:
df_clean_daily['Humid'] = df_clean_daily['RELH..']
df_clean_daily['PMSL'] = df_clean_daily['PMSL.in']
model = smf.ols(formula = "Load ~ PMSL", data = df_clean_daily)
fitted = model.fit()
fitted.summary()
Out[34]:
In [35]:
model = smf.ols(formula = "Load ~ Humid", data = df_clean_daily)
fitted = model.fit()
fitted.summary()
Out[35]:
OK both of these are very weak, forget 'em
with <10 min left, i had the idea of fitting a model by month
here we use pandas groupby to get a dataframe per month-of-year
In [36]:
grp = df_clean_daily.groupby('Month')
len(grp)
Out[36]:
and we create a jointplot per-month
In [37]:
models = {}
fitteds = {}
for (key, df_g) in grp:
g = sns.jointplot("Temp", "Load", data=df_g, kind="reg",
color=color, size=15)
g.ax_joint.set_title("Month == %s" % key)
#models[key] = smf.ols(formula = "Load ~ Temp", data = df_g)
#fitteds[key] = models[key].fit()
#print fitteds[key].summary()
while we were discussing the entries, i tested the assertion that Weekend/Weekday would explain the bimodal Load distribution
In [38]:
df_clean_daily['is_weekend'] = df_clean_daily.Day.map(lambda x: x in [5,6])
g = sns.lmplot("Temp", "Load", df_clean_daily, hue="is_weekend",
size=15)
i guess not... maybe it is just seasonal differences between daytime & nighttime temps
it turns out that we can show the fit on a single lmplot, the colors here are such that the summer months are actually the "cool" colors...
In [39]:
g = sns.lmplot("Temp", "Load", df_clean_daily, hue="Month", size=15)
In [51]:
df_clean_daily.dropna(subset=['Temp','Load'], inplace=True)
p2d = np.poly1d(np.polyfit(df_clean_daily.Temp, df_clean_daily.Load, 2))
xp = np.linspace(10, 100, 90)
plt.plot(df_clean_daily.Temp, df_clean_daily.Load, '.', alpha=0.5)
plt.plot(xp, p2d(xp), '-')
plt.show()