Analysis of U.S. Incomes by Occupation and Gender

Notebook by Harish Kesava Rao

Use of this dataset should cite the Bureau of Labor Statistics as per their copyright information: The Bureau of Labor Statistics (BLS) is a Federal government agency and everything that we publish, both in hard copy and electronically, is in the public domain, except for previously copyrighted photographs and illustrations. You are free to use our public domain material without specific permission, although we do ask that you cite the Bureau of Labor Statistics as the source.

What kind of questions are we aiming to answer through the data analysis?

  1. Is there an inequality of income in the labor force?
  2. Is it more pronounced in certain sectors than others?
  3. Can the dataset help us answer the questions?
    a. Is the dataset in a useable format?
    b. How much of pre-analysis preparation is required before we start analyzing the dataset?
  4. What other inferences can we derive out of the dataset?

Analyzing the dataset

Median weekly earnings of full-time wage and salary workers by detailed occupation and sex.
Occupation: Job title as given from BLS. Industry summaries are given in ALL CAPS.
All_workers: Number of workers male and female, in thousands.
All_weekly: Median weekly income including male and female workers, in USD.
M_workers: Number of male workers, in thousands.
M_weekly: Median weekly income for male workers, in USD.
F_workers: Number of female workers, in thousands.
F_weekly: Median weekly income for female workers, in USD.


In [194]:
import pandas as pd
from pandas import DataFrame, Series

We assign more meaningful and self explanatory column headers. The names list can be assigned to the names parameter when reading in the csv file.


In [195]:
names = ['Occupations','All_workers', 'Weekly_Income_Overall', 
                             'Male_Workers (in 1000s)', 'Median_weekly_income (Male)',
                             'Female_Workers (in 1000s)', 'Median_weekly_income (Female)']

In [196]:
#mac
income = pd.read_csv('/Users/Harish/Documents/HK_Work/Python/Python-for-Data-Analysis/Kaggle-US-Incomes/inc_occ_gender.csv'
                   ,names=names, header=None)

#win
#income = pd.read_csv(r'C:\Users\\Documents\Personal\Python-for-Data-Analysis\Kaggle-US-Incomes\inc_occ_gender.csv'
                # ,names=names, header=None)

To manipulate and perform operations on numerical columns, we will need to handle null/empty/non-numerical values in the numerical columns. In our dataset, this means everything except the Occupations column.


In [197]:
income.head()


Out[197]:
Occupations All_workers Weekly_Income_Overall Male_Workers (in 1000s) Median_weekly_income (Male) Female_Workers (in 1000s) Median_weekly_income (Female)
0 Occupation All_workers All_weekly M_workers M_weekly F_workers F_weekly
1 ALL OCCUPATIONS 109080 809 60746 895 48334 726
2 MANAGEMENT 12480 1351 7332 1486 5147 1139
3 Chief executives 1046 2041 763 2251 283 1836
4 General and operations managers 823 1260 621 1347 202 1002

The original header that was part of the csv can be dropped


In [198]:
income.drop(0, inplace=True)

In [229]:
income['income_index'] = income.index

The following looks more readable


In [232]:
income.head()


Out[232]:
Occupations All_workers Weekly_Income_Overall Male_Workers (in 1000s) Median_weekly_income (Male) Female_Workers (in 1000s) Median_weekly_income (Female) income_index
1 ALL OCCUPATIONS 109080 809 60746 895 48334 726 1
2 MANAGEMENT 12480 1351 7332 1486 5147 1139 2
3 Chief executives 1046 2041 763 2251 283 1836 3
4 General and operations managers 823 1260 621 1347 202 1002 4
5 Legislators 8 0 5 0 4 0 5

Now, we will slice the dataset to understand more about the data


In [200]:
#selecting just the occupations and worker count into a new dataframe
job_type_by_worker_count = income[['Occupations','All_workers']]

In [201]:
job_type_by_worker_count.sort_values(by='All_workers', ascending =False)


Out[201]:
Occupations All_workers
204 Miscellaneous health technologists and technic... 99
186 Respiratory therapists 99
279 SALES 9725
446 Miscellaneous assemblers and fabricators 950
40 Cost estimators 95
326 Reservation and transportation ticket agents a... 95
7 Marketing and sales managers 948
246 Food servers, nonrestaurant 93
109 Chemists and materials scientists 93
346 Office clerks, general 929
284 Parts salespersons 92
130 Miscellaneous community and social service spe... 92
141 Postsecondary teachers 917
216 Phlebotomists 91
390 Highway maintenance workers 91
110 Environmental scientists and geoscientists 90
70 Database administrators 90
180 Podiatrists 9
265 Ushers, lobby attendants, and ticket takers 9
437 Signal and track switch repairers 9
391 Rail-track laying and maintenance equipment op... 9
513 Adhesive bonding machine operators and tenders 9
91 Marine engineers and naval architects 9
449 Food and tobacco roasting, baking, and drying ... 9
31 Emergency management directors 9
484 Textile cutting machine setters, operators, an... 9
228 Parking enforcement workers 9
380 Reinforcing iron and rebar workers 9
366 Carpet, floor, and tile installers and finishers 89
255 Maids and housekeeping cleaners 876
... ... ...
1 ALL OCCUPATIONS 109080
164 Editors 108
187 Speech-language pathologists 108
43 Training and development specialists 107
247 Dining room and cafeteria attendants and barte... 107
56 Insurance underwriters 106
382 Sheet metal workers 106
319 Interviewers, except eligibility and loan 105
144 Secondary school teachers 1048
3 Chief executives 1046
337 Stock clerks and order fillers 1027
327 Information and record clerks, all other 100
262 Nonfarm animal caretakers 100
369 Paving, surfacing, and tamping equipment opera... 10
536 Sailors and marine oilers 10
512 Semiconductor processors 1
343 Desktop publishers 1
433 Commercial divers 1
106 Life scientists, all other 1
461 Milling and planing machine setters, operators... 1
466 Multiple machine tool setters, operators, and ... 1
78 Miscellaneous mathematical science occupations 1
481 Shoe machine operators and tenders 1
493 Model makers and patternmakers, wood 0
377 Paperhangers 0
358 Hunters and trappers 0
115 Sociologists 0
113 Survey researchers 0
556 Mine shuttle car operators 0
487 Extruding and forming machine setters, operato... 0

558 rows × 2 columns

The sorting did not happen by the All_workers column. Let's see why that is the case.


In [202]:
income.dtypes


Out[202]:
Occupations                      object
All_workers                      object
Weekly_Income_Overall            object
Male_Workers (in 1000s)          object
Median_weekly_income (Male)      object
Female_Workers (in 1000s)        object
Median_weekly_income (Female)    object
dtype: object

The columns are objects. We will need to convert them to primitive types, such as integer, string etc.


In [203]:
job_type_by_worker_count2 = income.apply(pd.to_numeric, errors = 'ignore')

In [204]:
job_type_by_worker_count2.dtypes


Out[204]:
Occupations                      object
All_workers                       int64
Weekly_Income_Overall            object
Male_Workers (in 1000s)           int64
Median_weekly_income (Male)      object
Female_Workers (in 1000s)         int64
Median_weekly_income (Female)    object
dtype: object

The conversion happened for All_workers, Male_Workers (in 1000s), Female_Workers (in 1000s), but not for the other columns. We will see why.


In [205]:
job_type_by_worker_count2.head(n=15)


Out[205]:
Occupations All_workers Weekly_Income_Overall Male_Workers (in 1000s) Median_weekly_income (Male) Female_Workers (in 1000s) Median_weekly_income (Female)
1 ALL OCCUPATIONS 109080 809 60746 895 48334 726
2 MANAGEMENT 12480 1351 7332 1486 5147 1139
3 Chief executives 1046 2041 763 2251 283 1836
4 General and operations managers 823 1260 621 1347 202 1002
5 Legislators 8 Na 5 Na 4 Na
6 Advertising and promotions managers 55 1050 29 Na 26 Na
7 Marketing and sales managers 948 1462 570 1603 378 1258
8 Public relations and fundraising managers 59 1557 24 Na 35 Na
9 Administrative services managers 170 1191 96 1451 73 981
10 Computer and information systems managers 636 1728 466 1817 169 1563
11 Financial managers 1124 1408 551 1732 573 1130
12 Compensation and benefits managers 23 Na 7 Na 16 Na
13 Human resources managers 254 1365 68 1495 186 1274
14 Training and development managers 37 Na 17 Na 20 Na
15 Industrial production managers 267 1485 221 1528 45 Na

The columns for which the type conversion did not happend contain 'Na' values in them. These look like NULL or NA values, but actually, they are strings with the value 'Na'. We can work around these values by replacing them with 0.


In [206]:
income.replace('Na',0,inplace=True)

In [207]:
income.head(n=15)


Out[207]:
Occupations All_workers Weekly_Income_Overall Male_Workers (in 1000s) Median_weekly_income (Male) Female_Workers (in 1000s) Median_weekly_income (Female)
1 ALL OCCUPATIONS 109080 809 60746 895 48334 726
2 MANAGEMENT 12480 1351 7332 1486 5147 1139
3 Chief executives 1046 2041 763 2251 283 1836
4 General and operations managers 823 1260 621 1347 202 1002
5 Legislators 8 0 5 0 4 0
6 Advertising and promotions managers 55 1050 29 0 26 0
7 Marketing and sales managers 948 1462 570 1603 378 1258
8 Public relations and fundraising managers 59 1557 24 0 35 0
9 Administrative services managers 170 1191 96 1451 73 981
10 Computer and information systems managers 636 1728 466 1817 169 1563
11 Financial managers 1124 1408 551 1732 573 1130
12 Compensation and benefits managers 23 0 7 0 16 0
13 Human resources managers 254 1365 68 1495 186 1274
14 Training and development managers 37 0 17 0 20 0
15 Industrial production managers 267 1485 221 1528 45 0

This looks better, without the Na string values. Now, we will convert all the columns, except Occupations to int. Occupations is a string column and we can leave it as an object type.


In [208]:
income_clean = income.apply(pd.to_numeric, errors = 'ignore')

In [209]:
income_clean.dtypes


Out[209]:
Occupations                      object
All_workers                       int64
Weekly_Income_Overall             int64
Male_Workers (in 1000s)           int64
Median_weekly_income (Male)       int64
Female_Workers (in 1000s)         int64
Median_weekly_income (Female)     int64
dtype: object

We will now slice the data and analyze the count of workers in each occupation.


In [210]:
top10_worker_count = income_clean.sort_values(by='All_workers', axis=0, 
                                                           ascending=False)[['Occupations','All_workers',
                                                                                    'Male_Workers (in 1000s)',
                                                                                    'Female_Workers (in 1000s)']]

In [211]:
top10_worker_count.drop(1, axis=0, inplace=True)

The first row shows the total number of workers. We will drop the row since it is not useful for our analysis and will skew the overall pattern.


In [212]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

In [228]:
top_worker_count_barplot = sb.factorplot(x='Occupations',
           y='All_workers', data=top10_worker_count.head(n=5), hue='All_workers', 
                                     size=7, aspect=2.3,kind='bar')
top_worker_count_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female", 
                             title = "Top worker counts and the occupations")
sb.set(font_scale=2.3)


We have a plot showing top employee count for 5 occupations. We will append gender wise numbers on the same plot. For this purpose, we use pandas' melt function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html


In [ ]:
top10_worker_count.head()

In [ ]:
top10_worker_count = pd.melt(top10_worker_count, id_vars=["Occupations"], 
                 value_vars=['All_workers','Male_Workers (in 1000s)','Female_Workers (in 1000s)'])

In [ ]:
top10_mf_count = income_clean.sort_values(by='All_workers', axis=0, 
                                                           ascending=False)[['Occupations',
                                                                                    'Male_Workers (in 1000s)',
                                                                                    'Female_Workers (in 1000s)']]

In [ ]:
top10_mf_count.drop(1, axis=0, inplace=True)

In [ ]:
top10_male_female = pd.melt(top10_mf_count, id_vars=["Occupations"], 
                 value_vars=['Male_Workers (in 1000s)','Female_Workers (in 1000s)'])

In [ ]:
top10_worker_count.replace('NaN',0,inplace=True)
top10_male_female.replace('NaN',0,inplace=True)

In [ ]:
sorted_worker_count_n = top10_worker_count.sort_values(by='value', axis=0, 
                                                           ascending=False)[['Occupations','variable','value']]

In [ ]:
sorted_male_female_count_n = top10_male_female.sort_values(by='value', axis=0, 
                                                           ascending=False)[['Occupations','variable','value']]

In [ ]:
sorted_worker_count_n.head(n=5)

In [214]:
sorted_male_female_count_n.head(n=5)


Out[214]:
Occupations variable value
557 OFFICE Female_Workers (in 1000s) 9933
1 MANAGEMENT Male_Workers (in 1000s) 7332
4 TRANSPORTATION Male_Workers (in 1000s) 5998
8 CONSTRUCTION Male_Workers (in 1000s) 5586
3 PRODUCTION Male_Workers (in 1000s) 5548

In [222]:
sorted_worker_count_n_barplot = sb.factorplot(x='Occupations',
           y='value', data=sorted_worker_count_n.head(n=10), hue='variable', 
                                         palette="bright",
                                     size=10, aspect=5,kind='bar')
sorted_worker_count_n_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female", 
                             title = "Top worker counts and the occupations")
sb.set(font_scale=4.0)


In the above plot, we see that there are some professions involving only male employees - such as pales, production, transportation, construction etc. Similarly, there are occuapations that involve only female employees - such as healthcare professionals, education etc. This is just a sample of 20 occupations. We can analyze more, as seen below.


In [ ]:
sorted_male_female_count_n_barplot = sb.factorplot(x='Occupations',
           y='value', data=sorted_male_female_count_n.head(n=10), hue='variable', 
                                         palette="bright",
                                     size=12, aspect=5,kind='bar')
sorted_male_female_count_n_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female", 
                             title = "Top male, female employee counts and the occupations")
sb.set(font_scale=4)

We can observe a similar trend in the above plot as well, where transportation, construction, production, maintenance and groundskeeping involve male employees and education and healthcare involves female employees. We can also see that business, office, sale etc. have both female and male employees.

Now, let us sort the Occupations by weekly median income, irrespective of gender of employees


In [ ]:
top10_income_count = income_clean.sort_values(by='Weekly_Income_Overall', axis=0, 
                                                           ascending=False)[['Occupations',
                                                                                    'Weekly_Income_Overall']]

low10_income_count = income_clean.sort_values(by='Weekly_Income_Overall', axis=0, 
                                                           ascending=True)[['Occupations',
                                                                                    'Weekly_Income_Overall']]

In [227]:
top10_income_count_barplot = sb.factorplot(x='Occupations',
           y='Weekly_Income_Overall', data=top10_income_count.head(n=5), hue='Occupations', 
                                         palette="bright",
                                     size=12, aspect=5, kind='bar')
top10_income_count_barplot.set(xlabel = "Occupations", ylabel = "Weekly Income", 
                             title = "Top Weekly Incomes and the respective occupations")
sb.set(font_scale=2)
#size=12, aspect=5,



In [160]:
low10_income_count = low10_income_count[(low10_income_count != 0).all(1)]

In [162]:
# lowest10_income_count_barplot = sb.factorplot(x='Weekly_Income_Overall',
#             y='Occupations', data=low10_income_count.head(n=10), hue='Occupations',palette="bright",
#                                              size=12, aspect=5, kind='bar')
# lowest10_income_count_barplot.set(xlabel='Occupations', ylabel = 'Weekly Income',
#                                  title = 'Lowest Weekly Incomes and the respective occupations')