In [24]:

    
# Author: Matthew Green
# Date: 12/7/2017
# License: MIT

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Introduction

I downloaded a CSV of City of Chicago employee salary data, which includes the names, titles, departments and salaries of Chicago employees. I was interested to see whether men and women earn similar salaries for similar roles.

City data don't report the gender of employees, so I used an employee's first name as a proxy, which is explained in more detail below.



In [2]:

    
workers = pd.read_csv('Current_Employee_Names__Salaries__and_Position_Titles.csv')

To simplify the analysis, I restricted my attention to full-time employees with a salary.



In [3]:

    
workers = workers[(workers['Salary or Hourly']=='Salary') & (workers['Full or Part-Time']=='F')]

To make grouping and matching first names easier, I extracted the first name from each employee record and lower-cased it:



In [4]:

    
workers['Name'] = workers['Name'].apply(lambda s: s.split(',')[1].split()[0].lower())

Annual salary is represented as a string with a leading $ sign, which I converted to floats.



In [5]:

    
workers['Annual Salary'] = workers['Annual Salary'].apply(lambda s: float(s.strip('$')))

The first ten rows of the data set look like this now:



In [6]:

    
workers.head(10)









    Out[6]:







  
    
      
      Name
      Job Titles
      Department
      Full or Part-Time
      Salary or Hourly
      Typical Hours
      Annual Salary
      Hourly Rate
    
  
  
    
      0
      jeffery
      SERGEANT
      POLICE
      F
      Salary
      NaN
      101442.0
      NaN
    
    
      1
      karina
      POLICE OFFICER (ASSIGNED AS DETECTIVE)
      POLICE
      F
      Salary
      NaN
      94122.0
      NaN
    
    
      2
      kimberlei
      CHIEF CONTRACT EXPEDITER
      GENERAL SERVICES
      F
      Salary
      NaN
      101592.0
      NaN
    
    
      3
      vicente
      CIVIL ENGINEER IV
      WATER MGMNT
      F
      Salary
      NaN
      110064.0
      NaN
    
    
      6
      christopher
      STAFF ASST TO THE ALDERMAN
      CITY COUNCIL
      F
      Salary
      NaN
      50436.0
      NaN
    
    
      9
      james
      FIRE ENGINEER-EMT
      FIRE
      F
      Salary
      NaN
      103350.0
      NaN
    
    
      10
      terry
      POLICE OFFICER
      POLICE
      F
      Salary
      NaN
      93354.0
      NaN
    
    
      12
      zaid
      POLICE OFFICER
      POLICE
      F
      Salary
      NaN
      84054.0
      NaN
    
    
      13
      abdalmahd
      POLICE OFFICER
      POLICE
      F
      Salary
      NaN
      87006.0
      NaN
    
    
      14
      aref
      FIREFIGHTER (PER ARBITRATORS AWARD)-PARAMEDIC
      FIRE
      F
      Salary
      NaN
      102228.0
      NaN

Gender prediction

To estimate the gender of an employee based on his or her first name, I used a data set of baby names. For each unique name, I counted how many times, from years 1940 to 2016, that name had been given to a boy versus a girl. If the name was more frequently given to boys, then I predicted the gender associated with the name to be male, and vice-versa for female.



In [7]:

    
# Data are in seperate CSV files per year, and are concatenated here
name_data = []
for yob in range(1940, 2017):
    df = pd.read_csv('names/yob' + str(yob) + '.txt', 
                     header=0, names=['Name', 'Gender', 'Count'])
    name_data.append(df)
names = pd.concat(name_data, axis=0)



In [8]:

    
# Lower-case first name so that it can be joined with the workers dataframe
names['Name'] = names['Name'].str.lower()



In [9]:

    
names.head(5)



In [10]:

    
# Count how often a name is given to boys and girls
gender_frequency = names.groupby(['Name', 'Gender']).sum().reset_index()



In [11]:

    
gender_frequency.sample(5)



In [12]:

    
def predict_gender(df):
    max_idx = df['Count'].idxmax()
    return df.loc[max_idx]



In [13]:

    
# Select the more frequent gender for each name
gender_guess = gender_frequency.groupby('Name').agg(predict_gender).reset_index()



In [14]:

    
gender_guess.sample(10)

The above list of names and associated genders can be combined with the worker data to predict the gender of each Chicago employee using a join:



In [15]:

    
workers = pd.merge(workers, gender_guess, on='Name', how='inner')



In [16]:

    
workers[['Name', 'Job Titles', 'Department', 'Gender', 'Annual Salary']].sample(10)









    Out[16]:







  
    
      
      Name
      Job Titles
      Department
      Gender
      Annual Salary
    
  
  
    
      5132
      alejandro
      SERGEANT
      POLICE
      M
      111474.0
    
    
      15833
      ronald
      FIREFIGHTER
      FIRE
      M
      90024.0
    
    
      8266
      timothy
      FIRE ENGINEER-PARAMEDIC
      FIRE
      M
      105804.0
    
    
      15113
      isabel
      HEAD LIBRARY CLERK
      PUBLIC LIBRARY
      F
      63876.0
    
    
      13747
      christophe
      POLICE OFFICER
      POLICE
      M
      90024.0
    
    
      5791
      eric
      SERGEANT
      POLICE
      M
      107988.0
    
    
      23372
      mayur
      CIVIL ENGINEER III
      TRANSPORTN
      M
      70092.0
    
    
      4141
      michael
      POLICE OFFICER
      POLICE
      M
      72510.0
    
    
      7175
      dwayne
      POLICE OFFICER
      POLICE
      M
      93354.0
    
    
      19643
      cleveland
      FIRE ENGINEER-EMT
      FIRE
      M
      110214.0

Analysis

I wanted to know wether men and women were paid equally if they shared the same job title and department. To answer this, I specifically looked at full-time, salaried employees, and jobs for which both men and women were employed under the same title and department.

For example, given the job title POLICE OFFICER in the POLICE department, a position for which both men and women are employed, do male and female officers have similar salaries? More generally, are men and women paid equally across all job titles and departments?



In [17]:

    
# Focus on these columns
workers = workers[['Job Titles', 'Department', 'Gender', 'Annual Salary']]



In [18]:

    
# Remove jobs for which only men or only women are employed
job_groups = workers[['Job Titles', 'Gender', 'Annual Salary']].groupby(['Job Titles'])
def male_and_female(grp):
    return np.any(grp['Gender']=='M') and np.any(grp['Gender']=='F')
job_groups = job_groups.filter(male_and_female)



In [19]:

    
# Look at the maximum salary of each gender for each job title
job_group_maximums = job_groups.groupby(['Job Titles', 'Gender']).agg(np.max)



In [20]:

    
job_group_maximums.head(30)









    Out[20]:







  
    
      
      
      Annual Salary
    
    
      Job Titles
      Gender
      
    
  
  
    
      A/SUPRV REDISTRICTING
      F
      44604.0
    
    
      M
      49236.0
    
    
      ACCIDENT ADJUSTER
      F
      53340.0
    
    
      M
      88344.0
    
    
      ACCOUNTANT I
      F
      76548.0
    
    
      M
      76548.0
    
    
      ACCOUNTANT II
      F
      84516.0
    
    
      M
      84516.0
    
    
      ACCOUNTANT III
      F
      92388.0
    
    
      M
      92388.0
    
    
      ACCOUNTANT IV
      F
      100776.0
    
    
      M
      100776.0
    
    
      ACCOUNTING TECHNICIAN I
      F
      70092.0
    
    
      M
      70092.0
    
    
      ACCOUNTING TECHNICIAN II
      F
      76932.0
    
    
      M
      76932.0
    
    
      ADMINISTRATIVE ASST
      F
      150000.0
    
    
      M
      165000.0
    
    
      ADMINISTRATIVE ASST II
      F
      70092.0
    
    
      M
      70092.0
    
    
      ADMINISTRATIVE ASST III
      F
      84420.0
    
    
      M
      84420.0
    
    
      ADMINISTRATIVE ASST TO DEPUTY CORPORATION COUNSEL
      F
      89076.0
    
    
      M
      62820.0
    
    
      ADMINISTRATIVE MANAGER
      F
      97740.0
    
    
      M
      112284.0
    
    
      ADMINISTRATIVE SECRETARY
      F
      74340.0
    
    
      M
      185004.0
    
    
      ADMINISTRATIVE SERVICES OFFICER I
      F
      92592.0
    
    
      M
      80568.0



In [21]:

    
higher_max_male_salary_count = 0
total_jobs = 0
for job_title, df in job_group_maximums.groupby(level=0):
    assert len(df) == 2
    if df.loc[(job_title, 'M')][0] > df.loc[(job_title, 'F')][0]:
        higher_max_male_salary_count += 1
    total_jobs += 1



In [22]:

    
higher_max_male_salary_percentage = 100 * higher_max_male_salary_count / total_jobs



In [23]:

    
higher_max_male_salary_percentage









    Out[23]:





29.748283752860413



In [25]:

    
ax = sns.stripplot(x="Gender", y="Annual Salary", data=job_group_maximums.reset_index(), jitter=True)



In [26]:

    
plt.show()



In [ ]:

	Name	Gender	Count
0	barbara	F	36732
1	patricia	F	32662
2	judith	F	22380
3	betty	F	22075
4	carol	F	21760

	Name	Gender	Count
45558	joslyn	F	9747
52124	kidada	F	93
88322	tailer	F	193
27454	earlina	F	6
90148	taytem	F	180

	Name	Gender	Count
60264	moayed	M	7
51581	leiasia	F	12
87844	xander	M	22482
67226	quantiana	F	5
24967	edroy	M	5
49449	lajae	F	18
53958	luvenia	F	1233
48929	kyran	M	2274
57741	matthan	M	247
43294	karam	M	722

	Name	Job Titles	Department	Full or Part-Time	Salary or Hourly	Typical Hours	Annual Salary	Hourly Rate
0	jeffery	SERGEANT	POLICE	F	Salary	NaN	101442.0	NaN
1	karina	POLICE OFFICER (ASSIGNED AS DETECTIVE)	POLICE	F	Salary	NaN	94122.0	NaN
2	kimberlei	CHIEF CONTRACT EXPEDITER	GENERAL SERVICES	F	Salary	NaN	101592.0	NaN
3	vicente	CIVIL ENGINEER IV	WATER MGMNT	F	Salary	NaN	110064.0	NaN
6	christopher	STAFF ASST TO THE ALDERMAN	CITY COUNCIL	F	Salary	NaN	50436.0	NaN
9	james	FIRE ENGINEER-EMT	FIRE	F	Salary	NaN	103350.0	NaN
10	terry	POLICE OFFICER	POLICE	F	Salary	NaN	93354.0	NaN
12	zaid	POLICE OFFICER	POLICE	F	Salary	NaN	84054.0	NaN
13	abdalmahd	POLICE OFFICER	POLICE	F	Salary	NaN	87006.0	NaN
14	aref	FIREFIGHTER (PER ARBITRATORS AWARD)-PARAMEDIC	FIRE	F	Salary	NaN	102228.0	NaN

	Name	Job Titles	Department	Gender	Annual Salary
5132	alejandro	SERGEANT	POLICE	M	111474.0
15833	ronald	FIREFIGHTER	FIRE	M	90024.0
8266	timothy	FIRE ENGINEER-PARAMEDIC	FIRE	M	105804.0
15113	isabel	HEAD LIBRARY CLERK	PUBLIC LIBRARY	F	63876.0
13747	christophe	POLICE OFFICER	POLICE	M	90024.0
5791	eric	SERGEANT	POLICE	M	107988.0
23372	mayur	CIVIL ENGINEER III	TRANSPORTN	M	70092.0
4141	michael	POLICE OFFICER	POLICE	M	72510.0
7175	dwayne	POLICE OFFICER	POLICE	M	93354.0
19643	cleveland	FIRE ENGINEER-EMT	FIRE	M	110214.0

		Annual Salary
Job Titles	Gender
A/SUPRV REDISTRICTING	F	44604.0
A/SUPRV REDISTRICTING	M	49236.0
ACCIDENT ADJUSTER	F	53340.0
ACCIDENT ADJUSTER	M	88344.0
ACCOUNTANT I	F	76548.0
ACCOUNTANT I	M	76548.0
ACCOUNTANT II	F	84516.0
ACCOUNTANT II	M	84516.0
ACCOUNTANT III	F	92388.0
ACCOUNTANT III	M	92388.0
ACCOUNTANT IV	F	100776.0
ACCOUNTANT IV	M	100776.0
ACCOUNTING TECHNICIAN I	F	70092.0
ACCOUNTING TECHNICIAN I	M	70092.0
ACCOUNTING TECHNICIAN II	F	76932.0
ACCOUNTING TECHNICIAN II	M	76932.0
ADMINISTRATIVE ASST	F	150000.0
ADMINISTRATIVE ASST	M	165000.0
ADMINISTRATIVE ASST II	F	70092.0
ADMINISTRATIVE ASST II	M	70092.0
ADMINISTRATIVE ASST III	F	84420.0
ADMINISTRATIVE ASST III	M	84420.0
ADMINISTRATIVE ASST TO DEPUTY CORPORATION COUNSEL	F	89076.0
ADMINISTRATIVE ASST TO DEPUTY CORPORATION COUNSEL	M	62820.0
ADMINISTRATIVE MANAGER	F	97740.0
ADMINISTRATIVE MANAGER	M	112284.0
ADMINISTRATIVE SECRETARY	F	74340.0
ADMINISTRATIVE SECRETARY	M	185004.0
ADMINISTRATIVE SERVICES OFFICER I	F	92592.0
ADMINISTRATIVE SERVICES OFFICER I	M	80568.0