Playing with Data from the Residential Sector Energy Consumption Estimates, 1949– 2012
Perry Grossman. 2015
Source: http://www.eia.gov/totalenergy/data/annual/#consumption
I downloaded this file:
http://www.eia.gov/totalenergy/data/browser/csv.cfm?tbl=T02.02
In [1]:
#Importing needed libraries:
import pandas as pd
from __future__ import division
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline
# IF Desired, for more plotting:
# import seaborn as sns
In [2]:
# Important to know which version is being used to know the available functionality.
pd.__version__
Out[2]:
In [5]:
# Manually cleaned the data in Libre Office; now using that data.
Res_Energy_Consumption_1949_2012 = pd.read_csv('MER_T02_02-10-30-2015 _v1.csv')
# Original Data: Raw_Res_Energy_Consumption_1949_2012 = read_csv('MER_T02_02.csv')
# "Residential Sector Energy Consumption Estimates, 1949– 2012"
# Source: http://www.eia.gov/totalenergy/data/annual/#consumption
In [6]:
# Top of the file
Res_Energy_Consumption_1949_2012.head(10)
Out[6]:
In [7]:
# The bottom of the file
Res_Energy_Consumption_1949_2012.tail(10)
#note the monthly and yearly data (MM =13)
Out[7]:
In [9]:
#Use "NaN" for missing data for the Value field.
Res_Energy_Consumption_1949_2012.Value.replace({'Value': "Not Available"}, "NaN")
# Data is in Trillion Btus
# Python Reference: http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook-missing-data
Out[9]:
In [10]:
# Get the sum of the Value field.
Res_Energy_Consumption_1949_2012.Value.sum
Out[10]:
In [11]:
# Show a cumulative sum
Res_Energy_Consumption_1949_2012.Value.cumsum()
Out[11]:
In [12]:
# Use only yearly data
# MM = 13.
Res_Energy_Consumption_yr = Res_Energy_Consumption_1949_2012[Res_Energy_Consumption_1949_2012.MM == 13]
In [14]:
# Use the pivot table function to show the values for each description
report = Res_Energy_Consumption_yr.pivot_table(values=['MSN', 'Value'],index=['Description'], aggfunc=np.sum)
report
# for just the top 10
#.head(n=10)
Out[14]:
Findings:
974,079 Trillion Btu Total Total Energy Consumed by the Residential Sector from 1949 to 2012.
408,070 Trillion Btu Total Fossil Fuels Consumed by the Residential Sector from 1949 to 2012.
Nearly, two-fifths of US energy over this time period is from fossil fuels.
In [15]:
#Total Energy Consumed by the Residential Sector
TERCBUS_yr= Res_Energy_Consumption_yr[Res_Energy_Consumption_yr.Description == 'Total Energy Consumed by the Residential Sector']
'Total Energy Consumed by the Residential Sector'
In [16]:
# split by year-- proportion in each year and stack them.
#
def get_max_record(group):
return group.ix[group.Value.idxmax()]
TERCBUSresult =TERCBUS_yr.groupby('YYYY').apply(get_max_record)
In [62]:
# Plot TERCBUS Values
TERCBUSresult.Value.plot()
# Making a title:
plt.title('Total Energy Consumed by the Residential Sector', weight='bold')
plt.ylabel('Trillion BTU')
plt.xlabel('Year')
plt.legend(['Total Energy'], loc ='best')
# The image can be saved as a pdf for later use:
plt.savefig('TERCBUSresult.pdf')
Finding: Total Energy Consumed by the Residential Sector generally increases over time, with some declines in recent years.
In [18]:
#Describe the numerical data for Total Energy Consumed by the Residential Sector.
TERCBUSresult.Value.describe()
Out[18]:
Findings: There is data for 65 years. The average total energy usage is 14,986 trillion BTUs.
In [19]:
# Create an interger for the maximum value.
s = int(max(TERCBUSresult.Value))
In [22]:
print "The maximum value (in trillion BTUs) is: %s " %s # insert value into a string with the %s placeholder
In [23]:
#Total Fossil Fuels Consumed by the Residential Sector = TFFCRS
# use the yearly number:
TFFCRS_yr= Res_Energy_Consumption_yr[Res_Energy_Consumption_yr.Description == 'Total Fossil Fuels Consumed by the Residential Sector']
In [24]:
TFFCRS_yr.head(10)
Out[24]:
In [25]:
TFFCRS_yr.tail(10)
Out[25]:
In [26]:
# Can just start from here now.
# Have it local, cleaned.
TFFCRS_yr.to_csv("TFFCRS_yr.csv", index=False)
In [27]:
# Summary for each name
# A series method.
TFFCRS_yr.groupby('YYYY')['Value'].describe()
Out[27]:
In [28]:
# Summary for each year
# a series method providing the mean.
TFFCRS_yr.groupby('YYYY')['Value'].mean()
Out[28]:
In [29]:
# Create a Function to get the maximum Value over the years.
def get_max_record(group):
return group.ix[group.Value.idxmax()]
TFFCRS_yr_max =TFFCRS_yr.groupby('Value').apply(get_max_record)
In [30]:
TFFCRS_yr_max.ix[TFFCRS_yr_max.Value.idxmax()]
Out[30]:
1972 had the highest fossil fuel usage.
In [31]:
# Create a Function to get the maximum Value over the years.
# Revised. _result
def get_max_record_result(group):
return group.ix[group.Value.idxmax()]
TFFCRS_yrresult =TFFCRS_yr.groupby('YYYY').apply(get_max_record)
In [32]:
# Plot TFFCRS Over Time
#TFFCRS_yr.Value.plot()
TFFCRS_yrresult.Value.plot(c ="green")
# Making a title:
plt.title('Total Fossil Fuels Consumed by the Residential Sector', weight='bold')
plt.ylabel('Trillion BTU')
plt.xlabel('Year')
plt.legend(['Fossil Fuels'])
Out[32]:
There was big growth in usage and then a big decline in about 1973.
That fits with the oil crisis:
http://en.wikipedia.org/wiki/1973_oil_crisis
In [33]:
# Plot TERCBUS Values over time
TERCBUSresult.Value.plot(c ="blue")
# Plot TFFCRS over time
#TFFCRS_yrresult.Value.plot()
TFFCRS_yrresult.Value.plot(c ="green")
# Making a title:
plt.title('Total Fossil Fuels (Green) and Total Energy Consumed (Blue) by the Residential Sector', weight='bold')
plt.ylabel('Trillion BTU')
plt.xlabel('Year')
plt.legend(['Total Energy','Fossil Fuels'], loc = 'best')
#plt.legend(['Total Energy'])
# The image can be saved as a pdf for later use:
plt.savefig('TFFCRS_TERCBUS.pdf')
# The image can be saved as a png file for web use:
plt.savefig('TFFCRS_TERCBUS.png')
Finding: While U.S. residential total energy use has continued to increase since 1970, fossil fuel use has leveled off.
In [34]:
# Playing with Regression: Fossil Fuels: TFFCRS by Year
# Though autocorrelation issues may exist.
# See https://www.udel.edu/htr/Statistics/Notes816/class20.PDF for a description
# Increase over time: Adj. R-squared: 0.042
# Statistically significant relationship at 90% Confidence Interval, but not at 95%: P =0.055
# But Durbin-Watson:0.105 suggests autocorrelation (less than 1 on a scale of 0 to 4
model = smf.ols('Value ~ YYYY', data=TFFCRS_yr)
results = model.fit()
results.summary()
Out[34]:
Finding:
Adj. R-squared: 0.042 indicates a mild positive relationship over time for residential fossil fuel consumption. The relationship is statistically significant at the 90% confidence interval (p = 0.055). However, auto-correlation is an important issue in time-series analyses and the Durbin-Watson test (d = 0.105) suggests autocorrelation and the concern that the level of statistical significance is underestimated.
As indicated in Wikipedia:
"Since d is approximately equal to 2(1 − r), where r is the sample autocorrelation of the residuals,[3] d = 2 indicates no autocorrelation. The value of d always lies between 0 and 4. If the Durbin–Watson statistic is substantially less than 2, there is evidence of positive serial correlation. As a rough rule of thumb, if Durbin–Watson is less than 1.0, there may be cause for alarm. Small values of d indicate successive error terms are, on average, close in value to one another, or positively correlated. If d > 2, successive error terms are, on average, much different in value from one another, i.e., negatively correlated. In regressions, this can imply an underestimation of the level of statistical significance."
Source: http://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic
Look to this exercise (https://www.udel.edu/htr/Statistics/Notes816/class20.PDF) for an interesting analysis of this energy data, with consideration of the energy crises of the early 1970s. Consider building on that work.
The grouping documentation can be found here:
http://pandas.pydata.org/pandas-docs/dev/groupby.html.
Split, Apply, and Combine.
In [35]:
Res_Energy_Consumption_yr_desc = Res_Energy_Consumption_yr.groupby(['Description'])
print Res_Energy_Consumption_yr
In [63]:
# This is not working.
#grouped = Res_Energy_Consumption_yr_desc.groupby('MSN')
In [39]:
Res_Energy_Consumption_yr_desc
Out[39]:
In [64]:
# Not working
#for Value, group in Res_Energy_Consumption_yr:
# print(Description)
In [ ]:
In [68]:
s = round(41666/449736*100,2)
print "Overall, the percent renewable out of primary residential is: %s" %s
In [45]:
#Show 2013 Monthly totals and yearly summary (MM=13)
TFFCRS2013 = Res_Energy_Consumption_1949_2012[Res_Energy_Consumption_1949_2012.YYYY == 2013].sort_index(by='MM',ascending=True)
In [47]:
# Print the dataframe
TFFCRS2013
Out[47]:
In [49]:
#birdseye, quick view of the data: as a cross tab-- just 2013 data.
pd.crosstab(TFFCRS2013.Description, TFFCRS2013.Value)
# Need to bin this.There are too many values
Out[49]:
In [50]:
#a hierarchical index
Res_Energy_Consumption_1949_2012.groupby(['MM']).size().ix[13]
# a series.
Out[50]:
In [51]:
#groupby
Res_Energy_Consumption_1949_2012.groupby(['YYYY','Description']).size().ix[1980]
Out[51]:
In [52]:
# Get the Max Record by Value: Fossil Fuels-- Residential
# Using the yearly dataframe
def get_max_record_yr(group):
return group.ix[group.Value.idxmax()]
get_max_record_yr(TFFCRS_yr)
Out[52]:
In [53]:
# Show the data types
TFFCRS_yr.dtypes
Out[53]:
In [54]:
#converting to float.
TFFCRS_yr['YYYYfl'] = np.float32(TFFCRS_yr.YYYY)
# Need to figure this out.
In [56]:
# Show the data types now with Year as a float.
TFFCRS_yr.dtypes
# It seems to have worked (with a warning)
# SEE: YYYYfl float32
# YYYY remains.
Out[56]:
In [70]:
# not working
#NG_result.NaturalGas
In [71]:
# not working
#Res_Energy_Consumption_yr.NaturalGas
In [72]:
#Res_Energy_Consumption_yr.groupby(['Year','NaturalGas']).size().ix[2000]
In [ ]:
In [74]:
# Following: http://allendowney.blogspot.com/
# from http://allendowney.blogspot.com/
# Value by year.
###NEEDS TO BE DONE BY EACH DESCRIPTION TYPE. and compared.
# also chart each comparison.
# Total energy data by year.
import statsmodels.formula.api as smf
model = smf.ols('Value ~ YYYY', data=Res_Energy_Consumption_1949_2012)
results = model.fit()
results.summary()
Out[74]:
Finding: Still a low Durbin-Watson. Look for a more interesting and practical regression analysis.
In [75]:
# plot this later
#result.TOTAL.plot()
In [78]:
# Merge in some other data and try to analyze it.
# Such as economic data.
# Perhaps do a state by state analysis of energy use and economic growth.
# TestData2 = read_csv('TestData2.csv')
In [79]:
# pandas join function is merge
# merge?
# merge(Res_Primary_E_Consumption, TestData2, on= [ 'Year'])
# simple join on year
# inner join is default
In [51]:
#merged = merge(Res_Primary_E_Consumption, TestData2, on= [ 'Year'])
In [80]:
#from __future__ import division
#if needed
In [93]:
#merged['NG'] = np.float32(merged.NaturalGas)
In [95]:
#merged['TTL'] = np.float32(merged.TOTAL)
In [101]:
#merged['RNW'] = np.float32(merged.Renewable)
In [98]:
# Create a new proportion variable
#merged['NG_percent'] = (merged.NG / merged.TTL)
In [81]:
# total test.. Create a new column called TOTALtest,
# need to add the number of persons
#merged['TOTALtest'] = np.floor(merged.TOTAL * merged.Col2)
#decimal numbers so use npl.floor
In [82]:
#merged.groupby(['Year', 'TOTALtest'])['Col3'].sum()
#for each year TOTALtest combo-- the total number of Col3.
#hierarchical labeling.
In [85]:
# Regression Analyses to come. For review
'''
import statsmodels.formula.api as smf
model = smf.ols('merged.TOTAL ~ merged.Col2', data=merged)
results_merged = model.fit()
results_merged.summary()
'''
Out[85]:
That is all I have for now.
In [ ]: