Clean the raw data

Data from U.S. Energy Information Administration

Combining and cleaning the raw csv files into a cleaned data set and coherent database.

Generally a good idea to have a separate data folder with the raw data.

When you clean the raw data, leave the raw in place, and create cleaned version with the steps included (ideal situation for Notebook).



In [1]:

    
# %install_ext http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information
%version_information numpy, scipy, matplotlib, pandas









    Out[1]:




Software Version
Python 2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython 3.2.1
OS Darwin 14.4.0 x86_64 i386 64bit
numpy 1.9.2
scipy 0.15.1
matplotlib 1.4.3
pandas 0.16.2
Thu Jul 16 11:42:02 2015 PDT



In [2]:

    
import numpy as np
import pandas as pd



In [3]:

    
!pwd









    



/Users/jonathan/Desktop/OSCON/OSCON-2015/deliver



In [4]:

    
# The cleaned data file is saved here:
output_file = "../data/coal_prod_cleaned.csv"



In [5]:

    
df1 = pd.read_csv("../data/coal_prod_2002.csv", index_col="MSHA_ID")
df2 = pd.read_csv("../data/coal_prod_2003.csv", index_col="MSHA_ID")
df3 = pd.read_csv("../data/coal_prod_2004.csv", index_col="MSHA_ID")
df4 = pd.read_csv("../data/coal_prod_2005.csv", index_col="MSHA_ID")
df5 = pd.read_csv("../data/coal_prod_2006.csv", index_col="MSHA_ID")
df6 = pd.read_csv("../data/coal_prod_2007.csv", index_col="MSHA_ID")
df7 = pd.read_csv("../data/coal_prod_2008.csv", index_col="MSHA_ID")
df8 = pd.read_csv("../data/coal_prod_2009.csv", index_col="MSHA_ID")
df9 = pd.read_csv("../data/coal_prod_2010.csv", index_col="MSHA_ID")
df10 = pd.read_csv("../data/coal_prod_2011.csv", index_col="MSHA_ID")
df11 = pd.read_csv("../data/coal_prod_2012.csv", index_col="MSHA_ID")



In [6]:

    
dframe = pd.concat((df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11))



In [7]:

    
# Noticed a probable typo in the data set: 
dframe['Company_Type'].unique()









    Out[7]:





array(['Indepedent Producer Operator', 'Operating Subsidiary', 'Contractor'], dtype=object)



In [8]:

    
# Correcting the Company_Type
dframe.loc[dframe['Company_Type'] == 'Indepedent Producer Operator', 'Company_Type'] = 'Independent Producer Operator'
dframe.head()









    Out[8]:






  
    
      
      Average_Employees
      Company_Type
      Labor_Hours
      Mine_Basin
      Mine_County
      Mine_Name
      Mine_State
      Mine_Status
      Mine_Type
      Operating_Company
      Operating_Company_Address
      Operation_Type
      Production_short_tons
      Union_Code
      Year
    
    
      MSHA_ID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      102838
      4
      Independent Producer Operator
      2712
      Appalachia Southern
      Bibb
      Hebron Mine
      Alabama
      Permanently abandoned
      Surface
      Birmingham Coal & Coke Company
      2477 Valleydale Rd. S. B3, Birmingham, AL 35244
      Mine only
      10572
      NaN
      2002
    
    
      103184
      5
      Independent Producer Operator
      2480
      Appalachia Southern
      Fayette
      Berry Mine
      Alabama
      Temporarily closed
      Surface
      Midas Coal Company Incorporate
      401 10th Avenue, S. E, Cullman, AL 35055
      Mine only
      9725
      NaN
      2002
    
    
      100329
      55
      Operating Subsidiary
      123618
      Appalachia Southern
      Jefferson
      Concord Mine
      Alabama
      Active
      Underground
      U S Steel Mining Company Llc
      8800 Oak Grove Mine Road, Adger, AL 35006
      Preparation Plant
      0
      United Mine Workers of America
      2002
    
    
      100851
      331
      Operating Subsidiary
      748182
      Appalachia Southern
      Jefferson
      Oak Grove Mine
      Alabama
      Active
      Underground
      U S Steel Mining Company Llc
      8800 Oak Grove Mine Rd, Adger, AL 35006
      Mine only
      1942153
      United Mine Workers of America
      2002
    
    
      102354
      28
      Independent Producer Operator
      55306
      Appalachia Southern
      Jefferson
      Lindbergh
      Alabama
      Active
      Surface
      C & H Mining Company Inc
      P.O. Box 70250, Tuscaloosa, AL 35407
      Mine only
      168446
      NaN
      2002



In [9]:

    
dframe[dframe.Year == 2003].head()









    Out[9]:






  
    
      
      Average_Employees
      Company_Type
      Labor_Hours
      Mine_Basin
      Mine_County
      Mine_Name
      Mine_State
      Mine_Status
      Mine_Type
      Operating_Company
      Operating_Company_Address
      Operation_Type
      Production_short_tons
      Union_Code
      Year
    
    
      MSHA_ID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      103117
      50
      Independent Producer Operator
      67199
      Appalachia Southern
      Cullman
      Mine #2
      Alabama
      Active
      Surface
      Twin Pines Coal Company Inc
      1874 County Road 15, Bremen, AL 35503
      Mine only
      177381
      NaN
      2003
    
    
      103246
      4
      Independent Producer Operator
      11075
      Appalachia Southern
      Franklin
      Bear Creek
      Alabama
      Active
      Surface
      Birmingham Coal & Coke Co., In
      P.O. Box 354, Lynn, AL 35575
      Mine only
      46049
      NaN
      2003
    
    
      103006
      3
      Independent Producer Operator
      5161
      Appalachia Southern
      Jackson
      Bledsoe Mine No 1
      Alabama
      Mine closed by MSHA
      Underground
      A L Select Inc
      P.O. Box 864, Stevenson, AL 35772
      Mine only
      500
      NaN
      2003
    
    
      103183
      8
      Independent Producer Operator
      19348
      Appalachia Southern
      Jackson
      Henager
      Alabama
      Active
      Surface
      Gtm Mining Corporation
      15693 Alabama Highway 71, Pisgah, AL 35765
      Mine only
      55187
      NaN
      2003
    
    
      100329
      23
      Operating Subsidiary
      52009
      Appalachia Southern
      Jefferson
      Concord Mine
      Alabama
      Active
      Surface
      Oak Grove Resources, Llc
      8800 Oak Grove Mine Road, Adger, AL 35006
      Preparation Plant
      0
      United Mine Workers of America
      2003

Final Cleaned Data Product



In [10]:

    
dframe.to_csv(output_file, )



In [ ]:

Software	Version
Python	2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython	3.2.1
OS	Darwin 14.4.0 x86_64 i386 64bit
numpy	1.9.2
scipy	0.15.1
matplotlib	1.4.3
pandas	0.16.2
Thu Jul 16 11:42:02 2015 PDT

	Average_Employees	Company_Type	Labor_Hours	Mine_Basin	Mine_County	Mine_Name	Mine_State	Mine_Status	Mine_Type	Operating_Company	Operating_Company_Address	Operation_Type	Production_short_tons	Union_Code	Year
MSHA_ID
102838	4	Independent Producer Operator	2712	Appalachia Southern	Bibb	Hebron Mine	Alabama	Permanently abandoned	Surface	Birmingham Coal & Coke Company	2477 Valleydale Rd. S. B3, Birmingham, AL 35244	Mine only	10572	NaN	2002
103184	5	Independent Producer Operator	2480	Appalachia Southern	Fayette	Berry Mine	Alabama	Temporarily closed	Surface	Midas Coal Company Incorporate	401 10th Avenue, S. E, Cullman, AL 35055	Mine only	9725	NaN	2002
100329	55	Operating Subsidiary	123618	Appalachia Southern	Jefferson	Concord Mine	Alabama	Active	Underground	U S Steel Mining Company Llc	8800 Oak Grove Mine Road, Adger, AL 35006	Preparation Plant	0	United Mine Workers of America	2002
100851	331	Operating Subsidiary	748182	Appalachia Southern	Jefferson	Oak Grove Mine	Alabama	Active	Underground	U S Steel Mining Company Llc	8800 Oak Grove Mine Rd, Adger, AL 35006	Mine only	1942153	United Mine Workers of America	2002
102354	28	Independent Producer Operator	55306	Appalachia Southern	Jefferson	Lindbergh	Alabama	Active	Surface	C & H Mining Company Inc	P.O. Box 70250, Tuscaloosa, AL 35407	Mine only	168446	NaN	2002

	Average_Employees	Company_Type	Labor_Hours	Mine_Basin	Mine_County	Mine_Name	Mine_State	Mine_Status	Mine_Type	Operating_Company	Operating_Company_Address	Operation_Type	Production_short_tons	Union_Code	Year
MSHA_ID
103117	50	Independent Producer Operator	67199	Appalachia Southern	Cullman	Mine #2	Alabama	Active	Surface	Twin Pines Coal Company Inc	1874 County Road 15, Bremen, AL 35503	Mine only	177381	NaN	2003
103246	4	Independent Producer Operator	11075	Appalachia Southern	Franklin	Bear Creek	Alabama	Active	Surface	Birmingham Coal & Coke Co., In	P.O. Box 354, Lynn, AL 35575	Mine only	46049	NaN	2003
103006	3	Independent Producer Operator	5161	Appalachia Southern	Jackson	Bledsoe Mine No 1	Alabama	Mine closed by MSHA	Underground	A L Select Inc	P.O. Box 864, Stevenson, AL 35772	Mine only	500	NaN	2003
103183	8	Independent Producer Operator	19348	Appalachia Southern	Jackson	Henager	Alabama	Active	Surface	Gtm Mining Corporation	15693 Alabama Highway 71, Pisgah, AL 35765	Mine only	55187	NaN	2003
100329	23	Operating Subsidiary	52009	Appalachia Southern	Jefferson	Concord Mine	Alabama	Active	Surface	Oak Grove Resources, Llc	8800 Oak Grove Mine Road, Adger, AL 35006	Preparation Plant	0	United Mine Workers of America	2003