The primary data sources are the OSHA website (http://www.osha.gov) and the Department of Labor Enforcement Data Catalog (http://ogesdw.dol.gov/).
In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from sklearn.linear_model import LinearRegression
from scipy import stats
%matplotlib inline
What does the dataset look like?
In [2]:
oshadata = pd.read_csv("../data/incidents_by_state.csv", low_memory=False)
oshadata[:5]
Out[2]:
Yeah, but what does it LOOK like? Let's check out a visualization of these incidents as a time series.
In [3]:
df = pd.DataFrame.from_csv("../data/incident_totals_year.csv", parse_dates=False)
df.osha_incidents.plot(color='r',lw=1.3)
plt.title('Workplace Fatalities and Catastrophes by Year')
plt.xlabel("1983-2012")
plt.ylabel("Fatal/Catastrophic Incidents")
plt.show()
It turns out that, because of irregularities in the way that these reports were logged before 1990, we should ignore what appears to be a trend in volume from 1983-1990. We know from the Bureau of Labor Statistics that there were actually more like 10,000 fatalities per year during that period. The graph also suggests that fatal events trail off after 2009, but it turns our that this is due to data maturation issues. BLS reports show that the number of fatal workplace incidents has been fairly steady at around 4,500 per year since 2010. So let's do a histogram to see the frequency of events by year, but we'll just focus in on 1990-2010
In [4]:
subset = df.loc[1990:2010]
subset.plot(kind='bar', stacked=False)
plt.title("Frequency of fatalities and catastrophic events by year")
plt.xlabel("1990-2010")
plt.ylabel("Frequency")
plt.show()
In [5]:
import pylab as pl
import numpy as np
datafile = "../data/osha_month.csv"
csvfile = open(datafile)
Let's start by grouping the data into seasons
In [6]:
month_list = []
fatality_list = []
for each_line in csvfile:
list_row = each_line.split(',')
if list_row[0] == 'Month':
continue
month_list.append( list_row[0] )
fatality_list.append( list_row[1] )
csvfile.close( )
month_dict = {}
season_dict = {}
for month in month_list:
if month in month_dict:
month_dict[month] = month_dict[month] + 1
else:
month_dict[month] = 1
if month == "12" or month == "1" or month == "2":
season = "Winter"
elif month == "3" or month == "4" or month == "5":
season = "Spring"
elif month == "6" or month == "7" or month == "8":
season = "Summer"
else:
season = "Fall"
if season in season_dict:
season_dict[season] = season_dict[season] + 1
else:
season_dict[season] = 1
month_pct_dict = {}
season_pct_dict = {}
number_of_month_codes = len( month_list )
for month in month_dict:
month_pct_dict[month] = float( month_dict[month] ) / number_of_month_codes * 100
for season in season_dict:
season_pct_dict[season] = float( season_dict[season] ) / number_of_month_codes * 100
In [7]:
print 'Month Details'
print '-------------'
for month in sorted(month_pct_dict, key=month_pct_dict.get, reverse=True):
print month, month_dict[month], month_pct_dict[month]
In [8]:
print 'Season Details'
print '--------------'
chart_dict = {}
for season in sorted(season_pct_dict, key=season_pct_dict.get, reverse=True):
chart_dict[season] = season_pct_dict[season]
print season, season_dict[season], season_pct_dict[season]
In [9]:
X = np.arange( len( chart_dict ) )
pl.bar(X, chart_dict.values(), align='center', width=0.5)
pl.xticks(X, chart_dict.keys())
ymax = max( chart_dict.values() ) + 1
pl.ylim(0, ymax)
pl.show()
In [10]:
import pylab as pl
import numpy as np
datafile = "../data/osha_data.csv"
csvfile = open(datafile)
Break out industries by Standard Industry Code (SIC).
In [11]:
sic_list = []
fatality_list = []
for each_line in csvfile:
list_row = each_line.split(',')
if list_row[0] == 'SIC':
continue
sic_list.append( list_row[0] )
fatality_list.append( list_row[1] )
csvfile.close( )
sic_dict = {}
for sic in sic_list:
if sic in sic_dict:
sic_dict[sic] = sic_dict[sic] + 1
else:
sic_dict[sic] = 1
sic_pct_dict = {}
number_of_sic_codes = len( sic_list )
for sic in sic_dict:
sic_pct_dict[sic] = float( sic_dict[sic] ) / number_of_sic_codes * 100
Print out top 25 SIC codes
In [12]:
loop = 0
chart_dict = {}
for sic in sorted(sic_pct_dict, key=sic_pct_dict.get, reverse=True):
chart_dict[sic] = sic_pct_dict[sic]
print sic, sic_dict[sic], sic_pct_dict[sic]
loop = loop + 1
if loop >= 25:
break
In [13]:
X = np.arange( len( chart_dict ) )
pl.bar(X, chart_dict.values(), align='center', width=0.5)
pl.xticks(X, chart_dict.keys())
ymax = max( chart_dict.values() ) + 1
pl.ylim(0, ymax)
pl.show()
In [101]:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
from pylab import pcolor, show, colorbar, xticks, yticks
In [102]:
year=['1984','1985','1986','1987','1988','1989','1990','1991','1992','1993','1994','1995',
'1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007',
'2008','2009','2010','2011','2012','2013','2014']
rates_array = []
In [103]:
def unemployment_data():
f = pd.read_csv("../data/unemployment_data.csv")
df = pd.DataFrame(f)
df_mean = df.mean()[0:]
df_rate = pd.DataFrame(data=df_mean, columns=['RATE'])
x = 0
for idx, val in enumerate(df_rate.values):
rates_array.append(float(val))
return df_rate
In [104]:
def construction_starts():
construction_file = pd.read_csv("../data/construction_starts.csv")
df = pd.DataFrame(construction_file)
return df
In [105]:
def unemployment_construction_correlation():
rates = unemployment_data()
starts = construction_starts()
# calculating Pearson Correlation Coefficient
print stats.pearsonr(starts["Total"], rates_array)
# plotting the graph.
ax = plt.gca()
ax.set_xlabel('Year', fontsize=14, color='b')
ax.set_ylabel('Unemployment', fontsize=11, color='b')
ax.plot(year,rates,color='red')
ax2 = ax.twinx()
ax2.set_ylabel('Construction Starts', fontsize=11, color='b')
ax2.plot(year,starts["Total"],color='blue')
plt.show()
plt.scatter(rates_array, starts["Total"], s=25, c='red', alpha=0.5)
ax3 = plt.gca()
ax3.set_ylabel('Construction Starts', fontsize=11, color='b')
ax3.set_xlabel('Unemployment', fontsize=11, color='r')
##includes the correlation fit line
m,b = np.polyfit(rates_array, starts["Total"],1)
plt.plot(rates_array, starts["Total"], rates_array, np.array(rates_array) * m + b)
plt.show()
In [106]:
def run():
unemployment_construction_correlation()
if __name__ == '__main__':
run()
Start by looking at variations by state
In [185]:
def stateview():
statedata = pd.read_csv("../data/incidents_state_totals.csv", usecols=range(2), index_col='state')
statedata.plot(kind='bar', stacked=False, figsize=(16, 4))
plt.title("Frequency of fatalities and catastrophic events by state")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()
In [107]:
statedata = pd.read_csv("../data/incidents_state_totals.csv", usecols=range(2), index_col='state')
statedata.plot(kind='bar', stacked=False, figsize=(16, 4))
plt.title("Frequency of fatalities and catastrophic events by state")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()
California is an outlier! Let's look only at the others.
In [109]:
fifty = statedata.drop(statedata.index[[4,41]])
fifty.plot(kind='bar', stacked=False, figsize=(16, 4))
plt.title("Frequency of fatalities and catastrophic events by state (except CA and SD)")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()
Now let's look at how things have changed over time on a state-level.
In [110]:
statehistory = pd.read_csv("../data/incidents_state_totals.csv", index_col='state')
statehistory.drop('totals', axis=1, inplace=True)
statehistory.drop('totals_to_1990', axis=1, inplace=True)
statehistory.drop('totals_since_2010', axis=1, inplace=True)
statehistory = statehistory.drop(statedata.index[[4,41]])
statehistory.plot(kind='bar', stacked=True, figsize=(16, 8))
plt.title("State change in fatal and catastrophic events over time")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()
run pearson correlation test to see if one is increasing as another decreases
In [179]:
def state_totals():
change = []
statetotals = pd.read_csv("../data/incidents_state_totals.csv")
df = pd.DataFrame(statetotals, columns=['state', 'totals_to_1995' , 'totals_to_2000', 'totals_to_2005', 'totals_to_2010'])
for index, row in df.iterrows():
n = ((row['totals_to_1995']-row['totals_to_2005'])+ (row['totals_to_2000']-row['totals_to_2010']))/2
change.append((n, row['state']))
rank = sorted(change, key=lambda x: x[0])
Compare the change over time in the Northeast to the Southeast
In [184]:
#Alabama, Georgia, South Carolina, Virginia
SE = [(df.loc[0, 'totals_to_1995'] + df.loc[10, 'totals_to_1995'] + df.loc[40, 'totals_to_1995']+ df.loc[46, 'totals_to_1995']),
(df.loc[0, 'totals_to_2000'] + df.loc[10, 'totals_to_2000'] + df.loc[40, 'totals_to_2000']+ df.loc[46, 'totals_to_2000']),
(df.loc[0, 'totals_to_2005'] + df.loc[10, 'totals_to_2005'] + df.loc[40, 'totals_to_2005']+ df.loc[46, 'totals_to_2005']),
(df.loc[0, 'totals_to_2010'] + df.loc[10, 'totals_to_2010'] + df.loc[40, 'totals_to_2010']+ df.loc[46, 'totals_to_2010'])]
#Connecticut, Massachusetts, New Hampshire, New York
NE = [(df.loc[6, 'totals_to_1995'] + df.loc[21, 'totals_to_1995'] + df.loc[29, 'totals_to_1995'] + df.loc[31, 'totals_to_1995']),
(df.loc[6, 'totals_to_2000'] + df.loc[21, 'totals_to_2000'] + df.loc[29, 'totals_to_2000'] + df.loc[31, 'totals_to_2000']),
(df.loc[6, 'totals_to_2005'] + df.loc[21, 'totals_to_2005'] + df.loc[29, 'totals_to_2005'] + df.loc[31, 'totals_to_2005']),
(df.loc[6, 'totals_to_2010'] + df.loc[21, 'totals_to_2010'] + df.loc[29, 'totals_to_2010'] + df.loc[31, 'totals_to_2010'])]
print stats.pearsonr(SE, NE)
Seasonality – Occupational fatalities and catastrophes are seasonal, with the most occurring in the summer and the fewest occurring in winter.
Industrial change – Certain industries are more dangerous than others. Based on this dataset, we found that the most dangerous appear to be: roofing, power line construction, electrical work, and carpentry.
Impact of the recession – The recession has had a significant impact on the construction industry. We found that the unemployment rate is inversely correlated with the number of building permits issued for residential construction projects (r = -0.6).
Variation by state or area – Fatalities and catastrophes do vary widely by state, though generally in proportion to the population size. We also found some evidence that decreases in certain regions are correlated with increases in other regions over the last 30 years. Most notably, as the number of workplace fatalities and catastrophes has decreased in the Southeast (defined as the cluster of Alabama, Georgia, South Carolina, and Virginia), incidents seem to have increased in the Northeast (Connecticut, Massachusetts, New Hampshire, and New York), with an r of about -0.72.
Our initial investigative analysis suggests several avenues for future analysis as well as predictive modeling projects.
Seasonality – Moving forward, the seasonality analysis could be done per year by region and/or state. With a more granular analysis, one might be able to identify if there is any correlation between workplace fatalities and extreme weather events or seasons. One event that would be a great test to run against this data would be 2012’s Hurricane Sandy or the Ice Storm in Atlanta, GA that occurred in February of 2014. A seasonality predictive model could allow OSHA to forecast future fatalities due to extreme weather by state. With this forecast the agency would be able to allocate proper budgets and plan ahead for preventative workplace training. Many industries would benefit on all future data that could be created with the data used in this project.
Industrial change – Moving forward it would be useful to look at how these hazardous industry rankings have changed over time. It would be interesting to look at relationships between these industries – perhaps looking for a cross correlation between and increase in fatalities in one industry and a decrease in another industry.
Impact of the recession – If would be interesting to expand this study to break things down industry – particularly by manufacturing and construction, and to look at the impact of previous recessions on these industries. Given the correlation that we did find, we could look at ways in which OSHA can pre-empt fatalities, with the knowledge that as unemployment rates decline, workplace fatalities are bound to increase.
Variation by state or area – We have enough data to break down these state incidents on a more granular level, by individual local area offices. It would be interesting to look at in-state variations, particularly over time. Even more interesting would be to incorporate county business pattern and population statistics to identify trends, and create a predictive model that anticipates fatality rates based upon industry and population growth. This kind of project could enable OSHA to better allocate resources across the regions in accordance with need, and even embark on redistricting that would ensure that each area office is grouped around a nexus of predicted hazardous activity.