Clean the raw data

Data from U.S. Energy Information Administration

Data URL: eia.gov

Combining and cleaning the raw csv files into a cleaned data set and coherent database.

Generally a good idea to have a separate data folder with the raw data.

When you clean the raw data, leave the raw in place, and create cleaned version with the steps included (ideal situation for Notebook).


In [1]:
# %install_ext http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information
%version_information numpy, scipy, matplotlib, pandas


Out[1]:
SoftwareVersion
Python2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython3.2.1
OSDarwin 14.4.0 x86_64 i386 64bit
numpy1.9.2
scipy0.15.1
matplotlib1.4.3
pandas0.16.2
Thu Jul 16 11:42:02 2015 PDT

In [2]:
import numpy as np
import pandas as pd

In [3]:
!pwd


/Users/jonathan/Desktop/OSCON/OSCON-2015/deliver

In [4]:
# The cleaned data file is saved here:
output_file = "../data/coal_prod_cleaned.csv"

In [5]:
df1 = pd.read_csv("../data/coal_prod_2002.csv", index_col="MSHA_ID")
df2 = pd.read_csv("../data/coal_prod_2003.csv", index_col="MSHA_ID")
df3 = pd.read_csv("../data/coal_prod_2004.csv", index_col="MSHA_ID")
df4 = pd.read_csv("../data/coal_prod_2005.csv", index_col="MSHA_ID")
df5 = pd.read_csv("../data/coal_prod_2006.csv", index_col="MSHA_ID")
df6 = pd.read_csv("../data/coal_prod_2007.csv", index_col="MSHA_ID")
df7 = pd.read_csv("../data/coal_prod_2008.csv", index_col="MSHA_ID")
df8 = pd.read_csv("../data/coal_prod_2009.csv", index_col="MSHA_ID")
df9 = pd.read_csv("../data/coal_prod_2010.csv", index_col="MSHA_ID")
df10 = pd.read_csv("../data/coal_prod_2011.csv", index_col="MSHA_ID")
df11 = pd.read_csv("../data/coal_prod_2012.csv", index_col="MSHA_ID")

In [6]:
dframe = pd.concat((df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11))

In [7]:
# Noticed a probable typo in the data set: 
dframe['Company_Type'].unique()


Out[7]:
array(['Indepedent Producer Operator', 'Operating Subsidiary', 'Contractor'], dtype=object)

In [8]:
# Correcting the Company_Type
dframe.loc[dframe['Company_Type'] == 'Indepedent Producer Operator', 'Company_Type'] = 'Independent Producer Operator'
dframe.head()


Out[8]:
Average_Employees Company_Type Labor_Hours Mine_Basin Mine_County Mine_Name Mine_State Mine_Status Mine_Type Operating_Company Operating_Company_Address Operation_Type Production_short_tons Union_Code Year
MSHA_ID
102838 4 Independent Producer Operator 2712 Appalachia Southern Bibb Hebron Mine Alabama Permanently abandoned Surface Birmingham Coal & Coke Company 2477 Valleydale Rd. S. B3, Birmingham, AL 35244 Mine only 10572 NaN 2002
103184 5 Independent Producer Operator 2480 Appalachia Southern Fayette Berry Mine Alabama Temporarily closed Surface Midas Coal Company Incorporate 401 10th Avenue, S. E, Cullman, AL 35055 Mine only 9725 NaN 2002
100329 55 Operating Subsidiary 123618 Appalachia Southern Jefferson Concord Mine Alabama Active Underground U S Steel Mining Company Llc 8800 Oak Grove Mine Road, Adger, AL 35006 Preparation Plant 0 United Mine Workers of America 2002
100851 331 Operating Subsidiary 748182 Appalachia Southern Jefferson Oak Grove Mine Alabama Active Underground U S Steel Mining Company Llc 8800 Oak Grove Mine Rd, Adger, AL 35006 Mine only 1942153 United Mine Workers of America 2002
102354 28 Independent Producer Operator 55306 Appalachia Southern Jefferson Lindbergh Alabama Active Surface C & H Mining Company Inc P.O. Box 70250, Tuscaloosa, AL 35407 Mine only 168446 NaN 2002

In [9]:
dframe[dframe.Year == 2003].head()


Out[9]:
Average_Employees Company_Type Labor_Hours Mine_Basin Mine_County Mine_Name Mine_State Mine_Status Mine_Type Operating_Company Operating_Company_Address Operation_Type Production_short_tons Union_Code Year
MSHA_ID
103117 50 Independent Producer Operator 67199 Appalachia Southern Cullman Mine #2 Alabama Active Surface Twin Pines Coal Company Inc 1874 County Road 15, Bremen, AL 35503 Mine only 177381 NaN 2003
103246 4 Independent Producer Operator 11075 Appalachia Southern Franklin Bear Creek Alabama Active Surface Birmingham Coal & Coke Co., In P.O. Box 354, Lynn, AL 35575 Mine only 46049 NaN 2003
103006 3 Independent Producer Operator 5161 Appalachia Southern Jackson Bledsoe Mine No 1 Alabama Mine closed by MSHA Underground A L Select Inc P.O. Box 864, Stevenson, AL 35772 Mine only 500 NaN 2003
103183 8 Independent Producer Operator 19348 Appalachia Southern Jackson Henager Alabama Active Surface Gtm Mining Corporation 15693 Alabama Highway 71, Pisgah, AL 35765 Mine only 55187 NaN 2003
100329 23 Operating Subsidiary 52009 Appalachia Southern Jefferson Concord Mine Alabama Active Surface Oak Grove Resources, Llc 8800 Oak Grove Mine Road, Adger, AL 35006 Preparation Plant 0 United Mine Workers of America 2003

Final Cleaned Data Product


In [10]:
dframe.to_csv(output_file, )

In [ ]: