NYC Energy Data Meetup : #LiveAnalytic

hello spectators & judges!

pull the data into pandas DataFrames



In [1]:

    
import pandas as pd
%pylab inline
df1 = pd.read_csv('./data/20140623/NYCEnergyData_dataset_062314_0.csv')
df2 = pd.read_csv('./data/20140623/NYCEnergyData_dataset_062314_NA.csv')









    



Populating the interactive namespace from numpy and matplotlib

take a quick peak at the data



In [2]:

    
df1.head()









    Out[2]:






  
    
      
      ROW.ID
      RTD.End.Time.Stamp
      HDM
      DOW
      Year
      Month
      Day
      Hour
      Load.kW
      TEMP.F
      ...
      CHC1
      CHC2
      CHC3
      CIG.feet
      HI6...F
      LO6...F
      PEAK.mph
      HI24...F
      LO24...F
      DWP..F
    
  
  
    
      0
       1
       1/1/2011 0:05
       0_1_1
       Sat
       2011
       1
       1
       1
       5267.6
       35.1
      ...
          1
       0
       0
       0
       39
       35.1
       0
       0
       0
       28.0
    
    
      1
       2
       1/1/2011 1:00
       1_1_1
       Sat
       2011
       1
       1
       2
       5132.2
       33.1
      ...
          1
       0
       0
       0
        0
        0.0
       0
       0
       0
       27.0
    
    
      2
       3
       1/1/2011 2:00
       2_1_1
       Sat
       2011
       1
       1
       3
       4905.6
       34.0
      ...
          1
       0
       0
       0
        0
        0.0
       0
       0
       0
       28.0
    
    
      3
       4
       1/1/2011 3:00
       3_1_1
       Sat
       2011
       1
       1
       4
       4716.8
       35.1
      ...
          1
       0
       0
       0
        0
        0.0
       0
       0
       0
       28.9
    
    
      4
       5
       1/1/2011 4:00
       4_1_1
       Sat
       2011
       1
       1
       5
       4594.8
       35.1
      ...
       2506
       0
       0
       0
        0
        0.0
       0
       0
       0
       28.0
    
  

5 rows × 36 columns



In [3]:

    
df2.head()









    Out[3]:






  
    
      
      ROW.ID
      RTD.End.Time.Stamp
      HDM
      DOW
      Year
      Month
      Day
      Hour
      Load.kW
      TEMP(F)
      ...
      CHC1
      CHC2
      CHC3
      CIG.feet
      HI6...F
      LO6...F
      PEAK.mph
      HI24...F
      LO24...F
      DWP..F
    
  
  
    
      0
       1
       1/1/2011 0:05
       0_1_1
       Sat
       2011
       1
       1
       1
       5267.6
       35.1
      ...
          1
       0
       0
       0
       39
       35.1
       0
       0
       0
       28.0
    
    
      1
       2
       1/1/2011 1:00
       1_1_1
       Sat
       2011
       1
       1
       2
       5132.2
       33.1
      ...
          1
       0
       0
       0
        0
        0.0
       0
       0
       0
       27.0
    
    
      2
       3
       1/1/2011 2:00
       2_1_1
       Sat
       2011
       1
       1
       3
       4905.6
       34.0
      ...
          1
       0
       0
       0
        0
        0.0
       0
       0
       0
       28.0
    
    
      3
       4
       1/1/2011 3:00
       3_1_1
       Sat
       2011
       1
       1
       4
       4716.8
       35.1
      ...
          1
       0
       0
       0
        0
        0.0
       0
       0
       0
       28.9
    
    
      4
       5
       1/1/2011 4:00
       4_1_1
       Sat
       2011
       1
       1
       5
       4594.8
       35.1
      ...
       2506
       0
       0
       0
        0
        0.0
       0
       0
       0
       28.0
    
  

5 rows × 36 columns

ah, Lucas indicates that _NA is full of NaN, and _0 has been "cleaned"; lets stick with the raw data and let pandas help us out

use DataFrame.set_index to make this officially a timeseries :)



In [4]:

    
df2.set_index(df2['RTD.End.Time.Stamp'].map(lambda x: pd.Timestamp(x)), inplace=True)
df2.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26301 entries, 2011-01-01 00:05:00 to 2013-12-31 23:00:00
Data columns (total 36 columns):
ROW.ID                26301 non-null int64
RTD.End.Time.Stamp    26301 non-null object
HDM                   26301 non-null object
DOW                   26301 non-null object
Year                  26301 non-null int64
Month                 26301 non-null int64
Day                   26301 non-null int64
Hour                  26301 non-null int64
Load.kW               26277 non-null float64
TEMP(F)               26080 non-null float64
RELH..                26081 non-null float64
SKNT.mph              26078 non-null float64
GUST.mph              4770 non-null float64
DRCT..                24697 non-null float64
QFLG                  26076 non-null object
PMSL.in               23299 non-null float64
ALTI.in               26106 non-null float64
P03D.in               7944 non-null float64
PCHA                  279 non-null float64
WNUM                  26145 non-null object
VSBY.miles            26104 non-null float64
P01I.in               3895 non-null float64
P03I.in               894 non-null float64
P06I.in               1105 non-null float64
P24I.in               693 non-null float64
SNOW.in               536 non-null float64
CHC1                  25861 non-null float64
CHC2                  15533 non-null float64
CHC3                  7501 non-null float64
CIG.feet              15544 non-null float64
HI6...F               4188 non-null float64
LO6...F               4188 non-null float64
PEAK.mph              1380 non-null float64
HI24...F              1378 non-null float64
LO24...F              1378 non-null float64
DWP..F                26080 non-null float64
dtypes: float64(26), int64(5), object(5)

how granular are these observations?



In [5]:

    
df2['Load.kW'].head(10)









    Out[5]:





RTD.End.Time.Stamp
2011-01-01 00:05:00    5267.6
2011-01-01 01:00:00    5132.2
2011-01-01 02:00:00    4905.6
2011-01-01 03:00:00    4716.8
2011-01-01 04:00:00    4594.8
2011-01-01 05:00:00    4577.3
2011-01-01 06:00:00    4715.9
2011-01-01 07:00:00    4699.9
2011-01-01 08:00:00    4844.1
2011-01-01 09:00:00    5021.7
Name: Load.kW, dtype: float64

ok, roughly hourly, and as we can see above there are a few non-numeric columns, what are these?



In [6]:

    
df2.QFLG.value_counts()









    Out[6]:





OK         26065
Caution       11
dtype: int64

ok, Lucas says QFLG is a "data quality check", how about WNUM?



In [7]:

    
df2.WNUM.value_counts()









    Out[7]:





mostly cloudy      7963
partly cloudy      5191
overcast           3862
mostly clear       3191
clear              1902
fog                1232
lt rain             970
lt rain; fog        505
lt drizzle; fog     207
mod rain; fog       194
lt snow             182
haze                163
lt snow; fog        115
hvy rain; fog        88
mod rain             48
...
; mod ice pellet; fog             1
squalls                           1
mod thunder shwr; squalls         1
; mod snow pellet; mod rain       1
; mod rain; fog                   1
; blowing snow; ice fog           1
thunder; squalls                  1
lt ice pellets                    1
lt snow; lt ice pellets           1
; fog; blowing snow               1
lt snow; mod ice pellet           1
lt frz rain; mod ice pellet       1
; lt snow; lt ice pellets         1
lt rain; mod snow                 1
lt frz drizzle; lt ice pellets    1
Length: 65, dtype: int64

looks fun, maybe we'll come back to this

rename some columns, this will allow for less typing when indexing/plotting these



In [8]:

    
df2.rename(columns={'Load.kW':'Load', 'TEMP(F)':'Temp' }, inplace=True)

Histogram of Load values

here i am using an awesome python visualization library built atop matplotlib called seaborn

http://web.stanford.edu/~mwaskom/software/seaborn/



In [9]:

    
import seaborn as sns
df2.Load.hist(figsize=(15,6), bins=100)









    Out[9]:





<matplotlib.axes.AxesSubplot at 0x10bc96f50>

Histogram of Temperature values



In [10]:

    
df2.Temp.hist(figsize=(15,6), bins=100)









    Out[10]:





<matplotlib.axes.AxesSubplot at 0x107736650>

Timeseries plot of Temperature



In [11]:

    
df2.Temp.plot(figsize=(15,6))









    Out[11]:





<matplotlib.axes.AxesSubplot at 0x10bd0fed0>

so we see some variance, and a few crazy outliers, e.g. did it really get below 20F in early May 2013?

Timeseries plot of Load



In [12]:

    
df2.Load.plot(figsize=(15,6))









    Out[12]:





<matplotlib.axes.AxesSubplot at 0x10c23d350>

we see a big drop in the fall of 2012.... This is likely Superstorm Sandy

Let's look at the predictive value of our weather data vs Load

what were those columns again??



In [13]:

    
df2.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26301 entries, 2011-01-01 00:05:00 to 2013-12-31 23:00:00
Data columns (total 36 columns):
ROW.ID                26301 non-null int64
RTD.End.Time.Stamp    26301 non-null object
HDM                   26301 non-null object
DOW                   26301 non-null object
Year                  26301 non-null int64
Month                 26301 non-null int64
Day                   26301 non-null int64
Hour                  26301 non-null int64
Load                  26277 non-null float64
Temp                  26080 non-null float64
RELH..                26081 non-null float64
SKNT.mph              26078 non-null float64
GUST.mph              4770 non-null float64
DRCT..                24697 non-null float64
QFLG                  26076 non-null object
PMSL.in               23299 non-null float64
ALTI.in               26106 non-null float64
P03D.in               7944 non-null float64
PCHA                  279 non-null float64
WNUM                  26145 non-null object
VSBY.miles            26104 non-null float64
P01I.in               3895 non-null float64
P03I.in               894 non-null float64
P06I.in               1105 non-null float64
P24I.in               693 non-null float64
SNOW.in               536 non-null float64
CHC1                  25861 non-null float64
CHC2                  15533 non-null float64
CHC3                  7501 non-null float64
CIG.feet              15544 non-null float64
HI6...F               4188 non-null float64
LO6...F               4188 non-null float64
PEAK.mph              1380 non-null float64
HI24...F              1378 non-null float64
LO24...F              1378 non-null float64
DWP..F                26080 non-null float64
dtypes: float64(26), int64(5), object(5)

lets take a small slice of this data to make our tests fast



In [14]:

    
df_slice = df2[pd.Timestamp('20130101'):pd.Timestamp('20130108')]
df_slice.Load.plot()









    Out[14]:





<matplotlib.axes.AxesSubplot at 0x10c6483d0>

OK, good we have just over a week of data, and the weekends are clearly lower-load

let's grab all of the numeric columns. we want everything after Day, except QFLG and WNUM. i'm going to just do this manually.



In [15]:

    
cols = df_slice.columns[8:]
cols









    Out[15]:





Index([u'Load', u'Temp', u'RELH..', u'SKNT.mph', u'GUST.mph', u'DRCT..', u'QFLG', u'PMSL.in', u'ALTI.in', u'P03D.in', u'PCHA', u'WNUM', u'VSBY.miles', u'P01I.in', u'P03I.in', u'P06I.in', u'P24I.in', u'SNOW.in', u'CHC1', u'CHC2', u'CHC3', u'CIG.feet', u'HI6...F', u'LO6...F', u'PEAK.mph', u'HI24...F', u'LO24...F', u'DWP..F'], dtype='object')



In [16]:

    
cols = [u'Load', u'Temp', u'RELH..', u'SKNT.mph', u'GUST.mph', u'DRCT..', u'PMSL.in', u'ALTI.in', u'P03D.in', u'PCHA', u'VSBY.miles', u'P01I.in', u'P03I.in', u'P06I.in', u'P24I.in', u'SNOW.in', u'CHC1', u'CHC2', u'CHC3', u'CIG.feet', u'HI6...F', u'LO6...F', u'PEAK.mph', u'HI24...F', u'LO24...F', u'DWP..F']



In [17]:

    
df_clean_slice = df_slice.query("QFLG == 'OK'")[cols]
df_clean_slice.head()









    Out[17]:






  
    
      
      Load
      Temp
      RELH..
      SKNT.mph
      GUST.mph
      DRCT..
      PMSL.in
      ALTI.in
      P03D.in
      PCHA
      ...
      CHC1
      CHC2
      CHC3
      CIG.feet
      HI6...F
      LO6...F
      PEAK.mph
      HI24...F
      LO24...F
      DWP..F
    
    
      RTD.End.Time.Stamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2013-01-01 00:05:00
       5214.3
       39.9
       57
       12.7
      NaN
       250
       29.89
       29.89
       236.48
      NaN
      ...
       654
       NaN
        NaN
       6500
       39.9
       37.9
      NaN
      NaN
      NaN
       26.1
    
    
      2013-01-01 01:00:00
       5045.7
       39.0
       59
       12.7
      NaN
       260
       29.90
       29.90
          NaN
      NaN
      ...
       604
       NaN
        NaN
       6000
        NaN
        NaN
      NaN
      NaN
      NaN
       26.1
    
    
      2013-01-01 02:00:00
       4831.3
       39.0
       59
       11.5
      NaN
       270
       29.90
       29.90
          NaN
      NaN
      ...
       463
       803
       2504
       4600
        NaN
        NaN
      NaN
      NaN
      NaN
       26.1
    
    
      2013-01-01 03:00:00
       4640.5
       39.9
       59
       15.0
      NaN
       260
       29.91
       29.91
        29.68
      NaN
      ...
       464
       NaN
        NaN
       4600
        NaN
        NaN
      NaN
      NaN
      NaN
       27.0
    
    
      2013-01-01 04:00:00
       4501.7
       39.9
       62
       17.3
      NaN
       250
       29.90
       29.90
          NaN
      NaN
      ...
       463
       554
        NaN
       4600
        NaN
        NaN
      NaN
      NaN
      NaN
       28.0
    
  

5 rows × 26 columns

Correlation matrix

back to the whole dataset, lets try this QFLG column



In [18]:

    
df_clean = df2.query("QFLG == 'OK'").copy()

generate the correlation matrix plot using seaborn's corrplot

http://web.stanford.edu/~mwaskom/software/seaborn/examples/many_pairwise_correlations.html



In [19]:

    
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")

f, ax = plt.subplots(figsize=(9, 9))
cmap = sns.blend_palette(["#00008B", "#6A5ACD", "#F0F8FF",
                          "#FFE6F8", "#C71585", "#8B0000"], as_cmap=True)
sns.corrplot(df_clean[cols], annot=False, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()

OK, some of the other variables seem strongly correlated with Load, interesting, what are these?



In [20]:

    
df2['HI6...F'].hist()









    Out[20]:





<matplotlib.axes.AxesSubplot at 0x10cdc9810>

thats a lot of zeros, how many of these values are good?



In [21]:

    
df2['HI6...F'].map(lambda x: x > 0.0).value_counts()









    Out[21]:





False    22466
True      3835
dtype: int64

OK nevermind, this is junk.

running low on time here, let's move on, how about we make this a daily timeseries, taking the mean of every numeric



In [22]:

    
df_clean_daily = df_clean.resample('D', how=np.mean)
df_clean_daily.Temp.plot()









    Out[22]:





<matplotlib.axes.AxesSubplot at 0x10c64a190>



In [23]:

    
df_clean_daily.head()









    Out[23]:






  
    
      
      ROW.ID
      Year
      Month
      Day
      Hour
      Load
      Temp
      RELH..
      SKNT.mph
      GUST.mph
      ...
      CHC1
      CHC2
      CHC3
      CIG.feet
      HI6...F
      LO6...F
      PEAK.mph
      HI24...F
      LO24...F
      DWP..F
    
    
      RTD.End.Time.Stamp
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01
        12.5
       2011
       1
       1
       12.5
       5269.729167
       38.829167
       79.875000
        7.062500
        0.000000
      ...
        995.416667
       1093.166667
         0.000
       10333.333333
       6.712500
       5.920833
       0
       1.791667
       1.379167
       33.104167
    
    
      2011-01-02
        36.5
       2011
       1
       2
       12.5
       5298.033333
       42.041667
       83.291667
        7.258333
        2.733333
      ...
        159.708333
        278.958333
       305.125
        2775.000000
       1.754167
       1.662500
       0
       2.037500
       1.500000
       37.012500
    
    
      2011-01-03
        60.5
       2011
       1
       3
       12.5
       6132.954167
       32.241667
       43.708333
       16.658333
       14.050000
      ...
        715.000000
          0.000000
         0.000
        3125.000000
       6.216667
       5.129167
       0
       1.500000
       1.166667
       12.395833
    
    
      2011-01-04
        84.5
       2011
       1
       4
       12.5
       6192.225000
       33.670833
       58.500000
        9.466667
        0.000000
      ...
       1399.916667
       1798.000000
       417.125
        8541.666667
       5.825000
       5.075000
       0
       1.579167
       1.166667
       20.650000
    
    
      2011-01-05
       108.5
       2011
       1
       5
       12.5
       6166.900000
       34.279167
       49.833333
       11.333333
        1.775000
      ...
        914.666667
       1318.208333
       208.500
        9583.333333
       6.287500
       5.287500
       0
       1.662500
       1.166667
       16.945833
    
  

5 rows × 31 columns

Daily Data : Scatter Plot w/ Histograms & Linear Model

here we use seaborn's jointplot

http://web.stanford.edu/~mwaskom/software/seaborn/examples/regression_marginals.html



In [24]:

    
color = sns.color_palette()[2]
g = sns.jointplot("Temp", "Load", data=df_clean_daily, kind="reg",
                  color=color, size=15)

woah, crazy boomerang!

we'll come back to this, maybe try a polynomial fit, in the meantime lets look at some more variables



In [25]:

    
df_clean['RELH..'].hist()









    Out[25]:





<matplotlib.axes.AxesSubplot at 0x10cf0ecd0>



In [26]:

    
df_clean['DWP..F'].hist()









    Out[26]:





<matplotlib.axes.AxesSubplot at 0x10cdec390>

A detour through some of the example questions

create a day-of-week column for some boxplots



In [27]:

    
df_clean['dow'] = df_clean.index.map(lambda x: x.weekday())
df_clean.dow.describe()









    Out[27]:





count    26065.000000
mean         3.000269
std          2.000566
min          0.000000
25%          1.000000
50%          3.000000
75%          5.000000
max          6.000000
dtype: float64

Load by Day-of-Week



In [28]:

    
df_clean.boxplot(column='Load', by='dow', figsize=(15,6))









    Out[28]:





<matplotlib.axes.AxesSubplot at 0x10eb15810>

Load by Month



In [29]:

    
import matplotlib.pyplot as plt
fig = plt.figure()
df_clean.boxplot(column='Load', by='Month', figsize=(15,6))
fig.title='Load by Month'









    





<matplotlib.figure.Figure at 0x10f40b210>

Load by Hour-of-Day



In [30]:

    
df_clean[df_clean.Month.map(lambda x: x in range(6,8))].boxplot(column='Load', by='Hour', figsize=(15,6))









    Out[30]:





<matplotlib.axes.AxesSubplot at 0x10f3e0690>

back to some modeling

Here's a closer look at the linear regressions that seaborn plotted for us above



In [31]:

    
import statsmodels.formula.api as smf



In [32]:

    
import statsmodels.api as sm

Temp vs Load



In [33]:

    
model = smf.ols(formula = "Load ~ Temp", data = df_clean_daily)
fitted = model.fit()
fitted.summary()









    Out[33]:





OLS Regression Results

  Dep. Variable:           Load          R-squared:             0.408 


  Model:                    OLS          Adj. R-squared:        0.407 


  Method:              Least Squares     F-statistic:           752.5 


  Date:              Mon, 30 Jun 2014    Prob (F-statistic):  1.91e-126


  Time:                  21:29:43        Log-Likelihood:      -8845.7 


  No. Observations:         1094         AIC:                1.770e+04


  Df Residuals:             1092         BIC:                1.771e+04


  Df Model:                    1                                      




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   3965.6361     85.485     46.390   0.000   3797.902  4133.370


  Temp          40.4373      1.474     27.431   0.000     37.545    43.330




  Omnibus:        35.251    Durbin-Watson:         0.412


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     38.044


  Skew:            0.456    Prob(JB):           5.48e-09


  Kurtosis:        3.048    Cond. No.               209.

interesting, lets keep this one

Other variables

full disclosure, during the event, i had a bogus correlation plot (on the 1 week of data), so got distracted by some highly negative correlations with the following variables

Load vs PMSL



In [34]:

    
df_clean_daily['Humid'] = df_clean_daily['RELH..']
df_clean_daily['PMSL'] = df_clean_daily['PMSL.in']
model = smf.ols(formula = "Load ~ PMSL", data = df_clean_daily)
fitted = model.fit()
fitted.summary()









    Out[34]:





OLS Regression Results

  Dep. Variable:           Load          R-squared:             0.002 


  Model:                    OLS          Adj. R-squared:        0.002 


  Method:              Least Squares     F-statistic:           2.733 


  Date:              Mon, 30 Jun 2014    Prob (F-statistic):   0.0986  


  Time:                  21:29:46        Log-Likelihood:      -9131.1 


  No. Observations:         1094         AIC:                1.827e+04


  Df Residuals:             1092         BIC:                1.828e+04


  Df Model:                    1                                      




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   8326.7166   1275.865      6.526   0.000   5823.292  1.08e+04


  PMSL         -70.3419     42.548     -1.653   0.099   -153.827    13.144




  Omnibus:        215.024    Durbin-Watson:         0.246


  Prob(Omnibus):   0.000     Jarque-Bera (JB):    363.404


  Skew:            1.253     Prob(JB):           1.22e-79


  Kurtosis:        4.300     Cond. No.           1.24e+03

Load vs Humidity



In [35]:

    
model = smf.ols(formula = "Load ~ Humid", data = df_clean_daily)
fitted = model.fit()
fitted.summary()









    Out[35]:





OLS Regression Results

  Dep. Variable:           Load          R-squared:             0.031 


  Model:                    OLS          Adj. R-squared:        0.030 


  Method:              Least Squares     F-statistic:           35.27 


  Date:              Mon, 30 Jun 2014    Prob (F-statistic):  3.85e-09 


  Time:                  21:29:48        Log-Likelihood:      -9115.1 


  No. Observations:         1094         AIC:                1.823e+04


  Df Residuals:             1092         BIC:                1.824e+04


  Df Model:                    1                                      




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   5473.7554    128.957     42.446   0.000   5220.723  5726.788


  Humid         11.4073      1.921      5.939   0.000      7.639    15.176




  Omnibus:        221.449    Durbin-Watson:         0.260


  Prob(Omnibus):   0.000     Jarque-Bera (JB):    388.584


  Skew:            1.255     Prob(JB):           4.17e-85


  Kurtosis:        4.491     Cond. No.               285.

OK both of these are very weak, forget 'em

Load vs. Temp, by Month

with <10 min left, i had the idea of fitting a model by month

here we use pandas groupby to get a dataframe per month-of-year



In [36]:

    
grp = df_clean_daily.groupby('Month')
len(grp)









    Out[36]:





12

and we create a jointplot per-month



In [37]:

    
models = {}
fitteds = {}

for (key, df_g) in grp:
    g = sns.jointplot("Temp", "Load", data=df_g, kind="reg",
                  color=color, size=15)
    g.ax_joint.set_title("Month == %s" % key)
    #models[key] = smf.ols(formula = "Load ~ Temp", data = df_g)
    #fitteds[key] = models[key].fit()
    #print fitteds[key].summary()

... and that's time, Thanks everybody!

while we were discussing the entries, i tested the assertion that Weekend/Weekday would explain the bimodal Load distribution



In [38]:

    
df_clean_daily['is_weekend'] = df_clean_daily.Day.map(lambda x: x in [5,6])
g = sns.lmplot("Temp", "Load", df_clean_daily, hue="is_weekend",
                  size=15)

i guess not... maybe it is just seasonal differences between daytime & nighttime temps

another look at a model per month

it turns out that we can show the fit on a single lmplot, the colors here are such that the summer months are actually the "cool" colors...



In [39]:

    
g = sns.lmplot("Temp", "Load", df_clean_daily, hue="Month", size=15)

Polynomial Fit



In [51]:

    
df_clean_daily.dropna(subset=['Temp','Load'], inplace=True)
p2d = np.poly1d(np.polyfit(df_clean_daily.Temp, df_clean_daily.Load, 2))
xp = np.linspace(10, 100, 90)
plt.plot(df_clean_daily.Temp, df_clean_daily.Load, '.', alpha=0.5)
plt.plot(xp, p2d(xp), '-')
plt.show()

	ROW.ID	RTD.End.Time.Stamp	HDM	DOW	Year	Month	Day	Hour	Load.kW	TEMP.F	...	CHC1	HI6...F	LO6...F	DWP..F
0	1	1/1/2011 0:05	0_1_1	Sat	2011	1	1	1	5267.6	35.1	...	1	39	35.1	28.0
1	2	1/1/2011 1:00	1_1_1	Sat	2011	1	1	2	5132.2	33.1	...	1	0	0.0	27.0
2	3	1/1/2011 2:00	2_1_1	Sat	2011	1	1	3	4905.6	34.0	...	1	0	0.0	28.0
3	4	1/1/2011 3:00	3_1_1	Sat	2011	1	1	4	4716.8	35.1	...	1	0	0.0	28.9
4	5	1/1/2011 4:00	4_1_1	Sat	2011	1	1	5	4594.8	35.1	...	2506	0	0.0	28.0

	Load	Temp	RELH..	SKNT.mph	GUST.mph	DRCT..	PMSL.in	ALTI.in	P03D.in	PCHA	...	CHC1	CHC2	CHC3	CIG.feet	HI6...F	LO6...F	PEAK.mph	HI24...F	LO24...F	DWP..F
RTD.End.Time.Stamp
2013-01-01 00:05:00	5214.3	39.9	57	12.7	NaN	250	29.89	29.89	236.48	NaN	...	654	NaN	NaN	6500	39.9	37.9	NaN	NaN	NaN	26.1
2013-01-01 01:00:00	5045.7	39.0	59	12.7	NaN	260	29.90	29.90	NaN	NaN	...	604	NaN	NaN	6000	NaN	NaN	NaN	NaN	NaN	26.1
2013-01-01 02:00:00	4831.3	39.0	59	11.5	NaN	270	29.90	29.90	NaN	NaN	...	463	803	2504	4600	NaN	NaN	NaN	NaN	NaN	26.1
2013-01-01 03:00:00	4640.5	39.9	59	15.0	NaN	260	29.91	29.91	29.68	NaN	...	464	NaN	NaN	4600	NaN	NaN	NaN	NaN	NaN	27.0
2013-01-01 04:00:00	4501.7	39.9	62	17.3	NaN	250	29.90	29.90	NaN	NaN	...	463	554	NaN	4600	NaN	NaN	NaN	NaN	NaN	28.0

	ROW.ID	Year	Month	Day	Hour	Load	Temp	RELH..	SKNT.mph	GUST.mph	...	CHC1	CHC2	CHC3	CIG.feet	HI6...F	LO6...F	PEAK.mph	HI24...F	LO24...F	DWP..F
RTD.End.Time.Stamp
2011-01-01	12.5	2011	1	1	12.5	5269.729167	38.829167	79.875000	7.062500	0.000000	...	995.416667	1093.166667	0.000	10333.333333	6.712500	5.920833	0	1.791667	1.379167	33.104167
2011-01-02	36.5	2011	1	2	12.5	5298.033333	42.041667	83.291667	7.258333	2.733333	...	159.708333	278.958333	305.125	2775.000000	1.754167	1.662500	0	2.037500	1.500000	37.012500
2011-01-03	60.5	2011	1	3	12.5	6132.954167	32.241667	43.708333	16.658333	14.050000	...	715.000000	0.000000	0.000	3125.000000	6.216667	5.129167	0	1.500000	1.166667	12.395833
2011-01-04	84.5	2011	1	4	12.5	6192.225000	33.670833	58.500000	9.466667	0.000000	...	1399.916667	1798.000000	417.125	8541.666667	5.825000	5.075000	0	1.579167	1.166667	20.650000
2011-01-05	108.5	2011	1	5	12.5	6166.900000	34.279167	49.833333	11.333333	1.775000	...	914.666667	1318.208333	208.500	9583.333333	6.287500	5.287500	0	1.662500	1.166667	16.945833

Dep. Variable:	Load	R-squared:	0.408
Model:	OLS	Adj. R-squared:	0.407
Method:	Least Squares	F-statistic:	752.5
Date:	Mon, 30 Jun 2014	Prob (F-statistic):	1.91e-126
Time:	21:29:43	Log-Likelihood:	-8845.7
No. Observations:	1094	AIC:	1.770e+04
Df Residuals:	1092	BIC:	1.771e+04
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	3965.6361	85.485	46.390	0.000	3797.902 4133.370
Temp	40.4373	1.474	27.431	0.000	37.545 43.330

Omnibus:	35.251	Durbin-Watson:	0.412
Prob(Omnibus):	0.000	Jarque-Bera (JB):	38.044
Skew:	0.456	Prob(JB):	5.48e-09
Kurtosis:	3.048	Cond. No.	209.

Omnibus:	215.024	Durbin-Watson:	0.246
Prob(Omnibus):	0.000	Jarque-Bera (JB):	363.404
Skew:	1.253	Prob(JB):	1.22e-79
Kurtosis:	4.300	Cond. No.	1.24e+03

Omnibus:	221.449	Durbin-Watson:	0.260
Prob(Omnibus):	0.000	Jarque-Bera (JB):	388.584
Skew:	1.255	Prob(JB):	4.17e-85
Kurtosis:	4.491	Cond. No.	285.