Team Colonials

Final Project Presentation

Awilda Lopez, Bala Venkatesan, Faith Bradley, Rebecca Bilbro

Cohort 2, Fall 2014

The Problem:

In the first three decades after OSHA was created in 1971, workplace fatalities dropped more than 65%, even as US employment doubled. Now worker deaths are down from about 38 a day in 1970 to 12 a day in 2013. But we've hit a plateau. OSHA's budget is small and not likely to increase. The agency can only do so much outreach, training, enforcement, and standard-setting. How can the agency make the most of its limited resources?

The Goal:

Use OSHA data to build a tool that will help the agency develop targeting schemas for outreach, training, enforcement, and regulation.

The Hypotheses:

1. Time of year - Are many of the most hazardous jobs seasonal?
2. Industries are changing - As hazards decrease in some traditional industries (e.g. traditional manufacturing), are more workers being killed in new industries (e.g. green energy)?
3. Impact of the recession - Are there fewer fatalities in construction because of the housing market collapse? Has the recession depressed the number of worker fatalities, artificially diminishing the impact of hazardous workplaces?
4. Variation by state or area - Are certain parts of the country becoming more or less dangerous over time?

The Data:

The primary data sources are the OSHA website (http://www.osha.gov) and the Department of Labor Enforcement Data Catalog (http://ogesdw.dol.gov/).


In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from sklearn.linear_model import LinearRegression
from scipy import stats
%matplotlib inline

What does the dataset look like?


In [2]:
oshadata = pd.read_csv("../data/incidents_by_state.csv", low_memory=False)
oshadata[:5]


Out[2]:
event_date state area_office fatality industry_code description keywords summary_nr
0 1972-09-21 UTAH 854910 NaN 7359 Employee suffers partial thumb amputation in t... SLIP,TABLE SAW,AMPUTATED,THUMB,GUARD,BLADE,SAW 170683817
1 1973-03-08 OHIO 524700 True 3444 Employee fatally crushed by falling automobile CRUSHED,STRUCK BY,FALLING OBJECT,AUTOMOBILE,BO... 14514202
2 1973-06-06 WEST VIRGINIA 316400 True 3334 Employee killed while dismantling crane gearbox DISMANTLING,CRANE,STRUCK BY,FALL,CRUSHED,UNTRA... 14481709
3 1974-07-16 TEXAS 627400 True 1381 One employee killed, two injured when struck b... OIL WELL DRILLING,TONGS,STRUCK BY,HAND TOOL,FL... 14413066
4 1974-07-16 TEXAS 627400 True 1381 Employee killed in trailer house fire FIRE,BURN,SMOKING,FLAMMABLE VAPORS,OIL WELL DR... 14413074

Yeah, but what does it LOOK like? Let's check out a visualization of these incidents as a time series.


In [3]:
df = pd.DataFrame.from_csv("../data/incident_totals_year.csv", parse_dates=False)
df.osha_incidents.plot(color='r',lw=1.3)
plt.title('Workplace Fatalities and Catastrophes by Year')
plt.xlabel("1983-2012")
plt.ylabel("Fatal/Catastrophic Incidents")
plt.show()


It turns out that, because of irregularities in the way that these reports were logged before 1990, we should ignore what appears to be a trend in volume from 1983-1990. We know from the Bureau of Labor Statistics that there were actually more like 10,000 fatalities per year during that period. The graph also suggests that fatal events trail off after 2009, but it turns our that this is due to data maturation issues. BLS reports show that the number of fatal workplace incidents has been fairly steady at around 4,500 per year since 2010. So let's do a histogram to see the frequency of events by year, but we'll just focus in on 1990-2010


In [4]:
subset = df.loc[1990:2010]
subset.plot(kind='bar', stacked=False)
plt.title("Frequency of fatalities and catastrophic events by year")
plt.xlabel("1990-2010")
plt.ylabel("Frequency")
plt.show()


The Analysis:

Hypothesis One (Awilda): Are fatalities and catastrophes mostly seasonal?


In [5]:
import pylab as pl
import numpy as np

datafile = "../data/osha_month.csv"
csvfile = open(datafile)

Let's start by grouping the data into seasons


In [6]:
month_list = []
fatality_list = []
for each_line in csvfile:
    list_row = each_line.split(',')
    if list_row[0] == 'Month':
        continue
    month_list.append( list_row[0] )
    fatality_list.append( list_row[1] )
csvfile.close( )

month_dict = {}
season_dict = {}
for month in month_list:
  if month in month_dict:
      month_dict[month] = month_dict[month] + 1
  else:
      month_dict[month] = 1
  if   month == "12" or month == "1" or month == "2":
      season = "Winter"
  elif month == "3" or month == "4" or month == "5":
      season = "Spring"
  elif month == "6" or month == "7" or month == "8":
      season = "Summer"
  else:
      season = "Fall"
  if season in season_dict:
      season_dict[season] = season_dict[season] + 1
  else:
      season_dict[season] = 1

month_pct_dict = {}
season_pct_dict = {}
number_of_month_codes = len( month_list )
for month in month_dict:
    month_pct_dict[month] = float( month_dict[month] ) / number_of_month_codes * 100
for season in season_dict:
    season_pct_dict[season] = float( season_dict[season] ) / number_of_month_codes * 100

In [7]:
print 'Month Details'
print '-------------'
for month in sorted(month_pct_dict, key=month_pct_dict.get, reverse=True):
    print month, month_dict[month], month_pct_dict[month]


Month Details
-------------
8 10368 9.89416828102
7 10250 9.78156104171
10 9866 9.41511036464
6 9552 9.11546059224
9 9301 8.87593163405
5 8638 8.24323163691
3 8317 7.93690177404
4 8317 7.93690177404
11 8064 7.69546421857
1 7572 7.22594928857
12 7315 6.98069453855
2 7229 6.89862485566

In [8]:
print 'Season Details'
print '--------------'
chart_dict = {}
for season in sorted(season_pct_dict, key=season_pct_dict.get, reverse=True):
    chart_dict[season] = season_pct_dict[season]
    print season, season_dict[season], season_pct_dict[season]


Season Details
--------------
Summer 30170 28.791189915
Fall 27231 25.9865062173
Spring 25272 24.117035185
Winter 22116 21.1052686828

In [9]:
X = np.arange( len( chart_dict ) )
pl.bar(X, chart_dict.values(), align='center', width=0.5)
pl.xticks(X, chart_dict.keys())
ymax = max( chart_dict.values() ) + 1
pl.ylim(0, ymax)
pl.show()


Hypothesis Two (Faith): As industries change (e.g. from traditional manufacturing to green energy), a plateau in fatalities could mask significant industry-specific decreases and increases over time.


In [10]:
import pylab as pl
import numpy as np
datafile = "../data/osha_data.csv"
csvfile = open(datafile)

Break out industries by Standard Industry Code (SIC).


In [11]:
sic_list = []
fatality_list = []
for each_line in csvfile:
    list_row = each_line.split(',')
    if list_row[0] == 'SIC':
        continue
    sic_list.append( list_row[0] )
    fatality_list.append( list_row[1] )

csvfile.close( )

sic_dict = {}
for sic in sic_list:
  if sic in sic_dict:
      sic_dict[sic] = sic_dict[sic] + 1
  else:
      sic_dict[sic] = 1

sic_pct_dict = {}
number_of_sic_codes = len( sic_list )
for sic in sic_dict:
    sic_pct_dict[sic] = float( sic_dict[sic] ) / number_of_sic_codes * 100

Print out top 25 SIC codes


In [12]:
loop = 0
chart_dict = {}
for sic in sorted(sic_pct_dict, key=sic_pct_dict.get, reverse=True):
    chart_dict[sic] = sic_pct_dict[sic]
    print sic, sic_dict[sic], sic_pct_dict[sic] 
    loop = loop + 1
    if loop >= 25:
        break


1761 3528 3.36676559563
1623 2913 2.77987193312
1731 2737 2.6119153728
1799 2411 2.30081401674
1751 2341 2.23401311206
1611 2284 2.17961808968
2411 2157 2.05842216263
1791 2144 2.04601628033
1542 2002 1.91050587371
1711 1907 1.81984750308
1521 1715 1.63662216454
1629 1494 1.42572216549
783 1348 1.28639456431
1721 1337 1.27589727929
1771 1243 1.1861932073
4911 1214 1.15851854679
4212 1212 1.15660994952
1741 1193 1.13847827539
1794 1182 1.12798099037
1541 1127 1.07549456527
2421 1092 1.04209411293
1742 1027 0.980064701448
1622 973 0.928532574984
1389 914 0.872228955329
5093 871 0.831194113886

In [13]:
X = np.arange( len( chart_dict ) )
pl.bar(X, chart_dict.values(), align='center', width=0.5)
pl.xticks(X, chart_dict.keys())
ymax = max( chart_dict.values() ) + 1
pl.ylim(0, ymax)
pl.show()


Hypothesis Three (Bala): The recession and housing market collapse have artificially reduced fatality rates, particularly in certain industries.


In [101]:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
from pylab import pcolor, show, colorbar, xticks, yticks

In [102]:
year=['1984','1985','1986','1987','1988','1989','1990','1991','1992','1993','1994','1995',
      '1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007',
      '2008','2009','2010','2011','2012','2013','2014']

rates_array = []

In [103]:
def unemployment_data():
    f = pd.read_csv("../data/unemployment_data.csv")
    df = pd.DataFrame(f)
    df_mean = df.mean()[0:]
    df_rate = pd.DataFrame(data=df_mean, columns=['RATE'])
    x = 0
    for idx, val in enumerate(df_rate.values):
        rates_array.append(float(val))

    return df_rate

In [104]:
def construction_starts():
    construction_file = pd.read_csv("../data/construction_starts.csv")
    df = pd.DataFrame(construction_file)
    return df

In [105]:
def unemployment_construction_correlation():

    rates = unemployment_data()
    starts = construction_starts()
    # calculating Pearson Correlation Coefficient
    print stats.pearsonr(starts["Total"], rates_array)

    # plotting the graph.

    ax = plt.gca()
    ax.set_xlabel('Year', fontsize=14, color='b')
    ax.set_ylabel('Unemployment', fontsize=11, color='b')
    ax.plot(year,rates,color='red')


    ax2 = ax.twinx()
    ax2.set_ylabel('Construction Starts', fontsize=11, color='b')
    ax2.plot(year,starts["Total"],color='blue')
    plt.show()


    plt.scatter(rates_array, starts["Total"], s=25, c='red', alpha=0.5)
    ax3 = plt.gca()
    ax3.set_ylabel('Construction Starts', fontsize=11, color='b')
    ax3.set_xlabel('Unemployment', fontsize=11, color='r')

    ##includes the correlation fit line
    m,b = np.polyfit(rates_array, starts["Total"],1)
    plt.plot(rates_array, starts["Total"], rates_array, np.array(rates_array) * m + b)
    plt.show()

In [106]:
def run():
    unemployment_construction_correlation()
    
if __name__ == '__main__':
    run()


(-0.60333102368122793, 0.00032705308523674057)

Hypothesis Four (Rebecca): The fatality/catastrophe plateau could be masking significant increases in certain parts of the country as incidents drop off in other places.

Start by looking at variations by state


In [185]:
def stateview():
    statedata = pd.read_csv("../data/incidents_state_totals.csv", usecols=range(2), index_col='state')
    statedata.plot(kind='bar', stacked=False, figsize=(16, 4))
    plt.title("Frequency of fatalities and catastrophic events by state")
    plt.xlabel("United States")
    plt.ylabel("Frequency")
    plt.show()

In [107]:
statedata = pd.read_csv("../data/incidents_state_totals.csv", usecols=range(2), index_col='state')
statedata.plot(kind='bar', stacked=False, figsize=(16, 4))
plt.title("Frequency of fatalities and catastrophic events by state")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()


California is an outlier! Let's look only at the others.


In [109]:
fifty = statedata.drop(statedata.index[[4,41]])
fifty.plot(kind='bar', stacked=False, figsize=(16, 4))
plt.title("Frequency of fatalities and catastrophic events by state (except CA and SD)")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()


Now let's look at how things have changed over time on a state-level.


In [110]:
statehistory = pd.read_csv("../data/incidents_state_totals.csv", index_col='state')
statehistory.drop('totals', axis=1, inplace=True)
statehistory.drop('totals_to_1990', axis=1, inplace=True)
statehistory.drop('totals_since_2010', axis=1, inplace=True)
statehistory = statehistory.drop(statedata.index[[4,41]])
statehistory.plot(kind='bar', stacked=True, figsize=(16, 8))
plt.title("State change in fatal and catastrophic events over time")
plt.xlabel("United States")
plt.ylabel("Frequency")
plt.show()


run pearson correlation test to see if one is increasing as another decreases


In [179]:
def state_totals():
    change = []
    statetotals = pd.read_csv("../data/incidents_state_totals.csv")
    df = pd.DataFrame(statetotals, columns=['state', 'totals_to_1995' , 'totals_to_2000', 'totals_to_2005', 'totals_to_2010'])
    for index, row in df.iterrows():
        n = ((row['totals_to_1995']-row['totals_to_2005'])+ (row['totals_to_2000']-row['totals_to_2010']))/2
        change.append((n, row['state']))
    rank = sorted(change, key=lambda x: x[0])

Compare the change over time in the Northeast to the Southeast


In [184]:
#Alabama, Georgia, South Carolina, Virginia
SE = [(df.loc[0, 'totals_to_1995'] + df.loc[10, 'totals_to_1995'] + df.loc[40, 'totals_to_1995']+ df.loc[46, 'totals_to_1995']),
      (df.loc[0, 'totals_to_2000'] + df.loc[10, 'totals_to_2000'] + df.loc[40, 'totals_to_2000']+ df.loc[46, 'totals_to_2000']),
      (df.loc[0, 'totals_to_2005'] + df.loc[10, 'totals_to_2005'] + df.loc[40, 'totals_to_2005']+ df.loc[46, 'totals_to_2005']),
      (df.loc[0, 'totals_to_2010'] + df.loc[10, 'totals_to_2010'] + df.loc[40, 'totals_to_2010']+ df.loc[46, 'totals_to_2010'])]

#Connecticut, Massachusetts, New Hampshire, New York
NE = [(df.loc[6, 'totals_to_1995'] + df.loc[21, 'totals_to_1995'] + df.loc[29, 'totals_to_1995'] + df.loc[31, 'totals_to_1995']),
      (df.loc[6, 'totals_to_2000'] + df.loc[21, 'totals_to_2000'] + df.loc[29, 'totals_to_2000'] + df.loc[31, 'totals_to_2000']),
      (df.loc[6, 'totals_to_2005'] + df.loc[21, 'totals_to_2005'] + df.loc[29, 'totals_to_2005'] + df.loc[31, 'totals_to_2005']),
      (df.loc[6, 'totals_to_2010'] + df.loc[21, 'totals_to_2010'] + df.loc[29, 'totals_to_2010'] + df.loc[31, 'totals_to_2010'])]

print stats.pearsonr(SE, NE)


(-0.72345686493053973, 0.27654313506946027)

The Interpretations:

  1. Seasonality – Occupational fatalities and catastrophes are seasonal, with the most occurring in the summer and the fewest occurring in winter.

  2. Industrial change – Certain industries are more dangerous than others. Based on this dataset, we found that the most dangerous appear to be: roofing, power line construction, electrical work, and carpentry.

  3. Impact of the recession – The recession has had a significant impact on the construction industry. We found that the unemployment rate is inversely correlated with the number of building permits issued for residential construction projects (r = -0.6).

  4. Variation by state or area – Fatalities and catastrophes do vary widely by state, though generally in proportion to the population size. We also found some evidence that decreases in certain regions are correlated with increases in other regions over the last 30 years. Most notably, as the number of workplace fatalities and catastrophes has decreased in the Southeast (defined as the cluster of Alabama, Georgia, South Carolina, and Virginia), incidents seem to have increased in the Northeast (Connecticut, Massachusetts, New Hampshire, and New York), with an r of about -0.72.

Conclusions and Next Steps: Modeling

Our initial investigative analysis suggests several avenues for future analysis as well as predictive modeling projects.

  1. Seasonality – Moving forward, the seasonality analysis could be done per year by region and/or state. With a more granular analysis, one might be able to identify if there is any correlation between workplace fatalities and extreme weather events or seasons. One event that would be a great test to run against this data would be 2012’s Hurricane Sandy or the Ice Storm in Atlanta, GA that occurred in February of 2014. A seasonality predictive model could allow OSHA to forecast future fatalities due to extreme weather by state. With this forecast the agency would be able to allocate proper budgets and plan ahead for preventative workplace training. Many industries would benefit on all future data that could be created with the data used in this project.

  2. Industrial change – Moving forward it would be useful to look at how these hazardous industry rankings have changed over time. It would be interesting to look at relationships between these industries – perhaps looking for a cross correlation between and increase in fatalities in one industry and a decrease in another industry.

  3. Impact of the recession – If would be interesting to expand this study to break things down industry – particularly by manufacturing and construction, and to look at the impact of previous recessions on these industries. Given the correlation that we did find, we could look at ways in which OSHA can pre-empt fatalities, with the knowledge that as unemployment rates decline, workplace fatalities are bound to increase.

  4. Variation by state or area – We have enough data to break down these state incidents on a more granular level, by individual local area offices. It would be interesting to look at in-state variations, particularly over time. Even more interesting would be to incorporate county business pattern and population statistics to identify trends, and create a predictive model that anticipates fatality rates based upon industry and population growth. This kind of project could enable OSHA to better allocate resources across the regions in accordance with need, and even embark on redistricting that would ensure that each area office is grouped around a nexus of predicted hazardous activity.

Questions?