Clean data

Coal mining data from eia.gov

Combining and cleaning the raw csv files into a cleaned data set and coherent database.

Generally a good idea to have a separate data folder with the raw data.

When you clean the raw data, leave the raw in place, and create cleaned version with the steps included (ideal situation for Notebook).



In [3]:

    
import numpy as np
import pandas as pd



In [4]:

    
!pwd









    



/Users/jonathan/github/jupyter-tips-and-tricks/deliver



In [5]:

    
# The cleaned data file is saved here:
output_file = "../data/coal_prod_cleaned.csv"



In [6]:

    
df7 = pd.read_csv("../data/coal_prod_2008.csv", index_col="MSHA_ID")
df8 = pd.read_csv("../data/coal_prod_2009.csv", index_col="MSHA_ID")
df9 = pd.read_csv("../data/coal_prod_2010.csv", index_col="MSHA_ID")
df10 = pd.read_csv("../data/coal_prod_2011.csv", index_col="MSHA_ID")
df11 = pd.read_csv("../data/coal_prod_2012.csv", index_col="MSHA_ID")



In [7]:

    
dframe = pd.concat((df7, df8, df9, df10, df11))



In [8]:

    
# Noticed a probable typo in the data set: 
dframe['Company_Type'].unique()









    Out[8]:





array(['Indepedent Producer Operator', 'Operating Subsidiary', 'Contractor'], dtype=object)



In [9]:

    
# Correcting the Company_Type
dframe.loc[dframe['Company_Type'] == 'Indepedent Producer Operator', 'Company_Type'] = 'Independent Producer Operator'
dframe.head()









    Out[9]:






  
    
      
      Average_Employees
      Company_Type
      Labor_Hours
      Mine_Basin
      Mine_County
      Mine_Name
      Mine_State
      Mine_Status
      Mine_Type
      Operating_Company
      Operating_Company_Address
      Operation_Type
      Production_short_tons
      Union_Code
      Year
    
    
      MSHA_ID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      103295
      18.0
      Independent Producer Operator
      39175.0
      Appalachia Southern
      Bibb
      Seymour Mine
      Alabama
      Active
      Surface
      Hope Coal Company Inc
      P.O. Box 249, Maylene, AL 35114
      Mine only
      105082.0
      NaN
      2008
    
    
      103117
      19.0
      Operating Subsidiary
      29926.0
      Appalachia Southern
      Cullman
      Mine #2, #3, #4
      Alabama
      Active, men working, not producing
      Surface
      Twin Pines Coal Company Inc
      1874 County Road 15, Bremen, AL 35033
      Mine only
      10419.0
      NaN
      2008
    
    
      103361
      20.0
      Operating Subsidiary
      42542.0
      Appalachia Southern
      Cullman
      Cold Springs West Mine
      Alabama
      Active
      Surface
      Twin Pines Coal Company
      74 Industrial Parkway, Jasper, AL 35502
      Mine only
      143208.0
      NaN
      2008
    
    
      100759
      395.0
      Operating Subsidiary
      890710.0
      Appalachia Southern
      Fayette
      North River # 1 Underground Mi
      Alabama
      Active
      Underground
      Chevron Mining Inc
      3114 County Road 63 S, Berry, AL 35546
      Mine and Preparation Plant
      2923261.0
      United Mine Workers of America
      2008
    
    
      103246
      22.0
      Independent Producer Operator
      55403.0
      Appalachia Southern
      Franklin
      Bear Creek
      Alabama
      Active
      Surface
      Birmingham Coal & Coke Co., In
      912 Edenton Street, Birmingham, AL 35242
      Mine only
      183137.0
      NaN
      2008



In [10]:

    
dframe[dframe.Year == 2008].head()









    Out[10]:






  
    
      
      Average_Employees
      Company_Type
      Labor_Hours
      Mine_Basin
      Mine_County
      Mine_Name
      Mine_State
      Mine_Status
      Mine_Type
      Operating_Company
      Operating_Company_Address
      Operation_Type
      Production_short_tons
      Union_Code
      Year
    
    
      MSHA_ID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      103295
      18.0
      Independent Producer Operator
      39175.0
      Appalachia Southern
      Bibb
      Seymour Mine
      Alabama
      Active
      Surface
      Hope Coal Company Inc
      P.O. Box 249, Maylene, AL 35114
      Mine only
      105082.0
      NaN
      2008
    
    
      103117
      19.0
      Operating Subsidiary
      29926.0
      Appalachia Southern
      Cullman
      Mine #2, #3, #4
      Alabama
      Active, men working, not producing
      Surface
      Twin Pines Coal Company Inc
      1874 County Road 15, Bremen, AL 35033
      Mine only
      10419.0
      NaN
      2008
    
    
      103361
      20.0
      Operating Subsidiary
      42542.0
      Appalachia Southern
      Cullman
      Cold Springs West Mine
      Alabama
      Active
      Surface
      Twin Pines Coal Company
      74 Industrial Parkway, Jasper, AL 35502
      Mine only
      143208.0
      NaN
      2008
    
    
      100759
      395.0
      Operating Subsidiary
      890710.0
      Appalachia Southern
      Fayette
      North River # 1 Underground Mi
      Alabama
      Active
      Underground
      Chevron Mining Inc
      3114 County Road 63 S, Berry, AL 35546
      Mine and Preparation Plant
      2923261.0
      United Mine Workers of America
      2008
    
    
      103246
      22.0
      Independent Producer Operator
      55403.0
      Appalachia Southern
      Franklin
      Bear Creek
      Alabama
      Active
      Surface
      Birmingham Coal & Coke Co., In
      912 Edenton Street, Birmingham, AL 35242
      Mine only
      183137.0
      NaN
      2008

Final Cleaned Data Product



In [11]:

    
dframe.to_csv(output_file, )



In [ ]:

	Average_Employees	Company_Type	Labor_Hours	Mine_Basin	Mine_County	Mine_Name	Mine_State	Mine_Status	Mine_Type	Operating_Company	Operating_Company_Address	Operation_Type	Production_short_tons	Union_Code	Year
MSHA_ID
103295	18.0	Independent Producer Operator	39175.0	Appalachia Southern	Bibb	Seymour Mine	Alabama	Active	Surface	Hope Coal Company Inc	P.O. Box 249, Maylene, AL 35114	Mine only	105082.0	NaN	2008
103117	19.0	Operating Subsidiary	29926.0	Appalachia Southern	Cullman	Mine #2, #3, #4	Alabama	Active, men working, not producing	Surface	Twin Pines Coal Company Inc	1874 County Road 15, Bremen, AL 35033	Mine only	10419.0	NaN	2008
103361	20.0	Operating Subsidiary	42542.0	Appalachia Southern	Cullman	Cold Springs West Mine	Alabama	Active	Surface	Twin Pines Coal Company	74 Industrial Parkway, Jasper, AL 35502	Mine only	143208.0	NaN	2008
100759	395.0	Operating Subsidiary	890710.0	Appalachia Southern	Fayette	North River # 1 Underground Mi	Alabama	Active	Underground	Chevron Mining Inc	3114 County Road 63 S, Berry, AL 35546	Mine and Preparation Plant	2923261.0	United Mine Workers of America	2008
103246	22.0	Independent Producer Operator	55403.0	Appalachia Southern	Franklin	Bear Creek	Alabama	Active	Surface	Birmingham Coal & Coke Co., In	912 Edenton Street, Birmingham, AL 35242	Mine only	183137.0	NaN	2008