Coal mining data from eia.gov
Combining and cleaning the raw csv files into a cleaned data set and coherent database.
Generally a good idea to have a separate data folder with the raw data.
When you clean the raw data, leave the raw in place, and create cleaned version with the steps included (ideal situation for Notebook).
In [3]:
import numpy as np
import pandas as pd
In [4]:
!pwd
In [5]:
# The cleaned data file is saved here:
output_file = "../data/coal_prod_cleaned.csv"
In [6]:
df7 = pd.read_csv("../data/coal_prod_2008.csv", index_col="MSHA_ID")
df8 = pd.read_csv("../data/coal_prod_2009.csv", index_col="MSHA_ID")
df9 = pd.read_csv("../data/coal_prod_2010.csv", index_col="MSHA_ID")
df10 = pd.read_csv("../data/coal_prod_2011.csv", index_col="MSHA_ID")
df11 = pd.read_csv("../data/coal_prod_2012.csv", index_col="MSHA_ID")
In [7]:
dframe = pd.concat((df7, df8, df9, df10, df11))
In [8]:
# Noticed a probable typo in the data set:
dframe['Company_Type'].unique()
Out[8]:
In [9]:
# Correcting the Company_Type
dframe.loc[dframe['Company_Type'] == 'Indepedent Producer Operator', 'Company_Type'] = 'Independent Producer Operator'
dframe.head()
Out[9]:
In [10]:
dframe[dframe.Year == 2008].head()
Out[10]:
In [11]:
dframe.to_csv(output_file, )
In [ ]: