Satya Gatiganti
Fall 2016
A ubiquitous issue that still exists in modern society is the presence of a gender wage gap. It would be interesting to further analyze the issue by observing how different wage gaps are between countries the world and determine what possible political and economic reasons could attribute to a country's wage gap. Then a closer study will be conducted on the gender wage gap in the United States and how the wage gap changes in different industries of profession, as well as by the role of other demographic factors like race. There seem to be an innumerable number of reasons for the gender wage gap. This project aims to focus on a few interesting factors that could have correlation with the gender wage gap.
This first part of the analysis requires the use of the data source: https://data.oecd.org/earnwage/gender-wage-gap.htm and www.ilo.org/gwr-figures to construct visual depictions of the gender wage gap around the world.
Women's Political and Economic Rights:
Human Development Index:
GDP Per Capita:
US Wage Gap Segmented by Profession:
US Wage Gap Segmented by Race:
In [4]:
%matplotlib inline
import sys
import pandas as pd
import pandas_datareader as
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import pycountry
import seaborn.apionly as sns
from plotly.offline import iplot, iplot_mpl # plotting functions
import plotly.graph_objs as go # ditto
import plotly # just to print version and init notebook
import cufflinks as cf # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)
import plotly.plotly as py
# these lines make our graphics show up in the notebook
%matplotlib inline
plotly.offline.init_notebook_mode(connected=True)
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
The first data source (https://data.oecd.org/earnwage/gender-wage-gap.htm) describes the gender wage gap in percentage terms, of 26 different countries and the other data source (www.ilo.org/gwr-figures) contains similar countries, but also includes developing countries that the first data source lacks. I will be using both of these data sources for my analysis.
In [5]:
url = "https://stats.oecd.org/sdmx-json/data/DP_LIVE/.WAGEGAP.../OECD?contentType=csv&detail=code&separator=comma&csv-lang=en"
wg = pd.read_csv(url)
wg
wg = wg[wg.SUBJECT!= 'SELFEMPLOYED'] #Only interested in total employment, and not just self employment
wg = wg[wg.TIME == 2010] #Only looking at the wage gaps for the most recent year: 2010
wg = wg.drop('INDICATOR', axis =1).drop('MEASURE', axis =1).drop('FREQUENCY', axis = 1).drop('Flag Codes', axis =1).drop('SUBJECT', axis=1).drop('TIME', axis =1)#these
#Dropped all columns that I do not need
wg.columns = ["ISO", "Gender Wage Gap in % Difference"]
wg.head()
Out[5]:
In [6]:
file2 = 'data/GWG.xlsx'
gw = pd.read_excel(file2, encoding='latin-1')
gw = gw.drop('Explained wage gap', axis =1)
gw.columns = ['Country', 'Gender Wage Gap in % Difference']
gw
import csv
dic={}
# with open("wikipedia-iso-country-codes.csv") as f:
# file= csv.DictReader(f, delimiter=',')
# for line in file:
# dic[line['English short name lower case']]=line['Alpha-3 code']
# countries=gw['Country']
# [dic[x] for x in countries]
#Copied and pasted the values that were obtained from the line of code above
gw['ISO'] = ['USA',
'IRL',
'GBR',
'EST',
'ISL',
'CZE',
'CYP',
'NOR',
'AUT',
'NLD',
'DEU',
'GRC',
'SVK',
'BEL',
'FIN',
'BGR',
'FRA',
'ITA',
'ESP',
'LUX',
'DNK',
'LVA',
'ROU',
'PRT',
'HUN',
'POL',
'SVN',
'LTU',
'SWE',
'RUS',
'ARG',
'URY',
'BRA',
'CHL',
'CHN',
'PER',
'MEX',
'VNM',
'IND']
gw = gw[['ISO', 'Country', 'Gender Wage Gap in % Difference']]
gw = gw.drop('Country', axis =1)
gw.head()
Out[6]:
In [7]:
combination = pd.concat([wg, gw,], axis = 0)
combination = combination.drop_duplicates('ISO')
combination.head()
Out[7]:
In [8]:
layout = dict(geo={"scope": "world", "resolution": 150},
title = 'Gender Wage Gaps',
width=750, height=550)
In [9]:
trace = dict(type="choropleth",
locations=combination["ISO"], # use ISO names
z=combination["Gender Wage Gap in % Difference"], # defines the color,
colorscale=[[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
[0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
text=combination.index,
colorbar = dict(title = 'Gender Wage Gap in % Difference between Men and Women'),
)
iplot(go.Figure(data=[trace], layout=layout), link_text="")
By looking at the map, we can make several general observations about the difference in gender wage gaps in different areas of the world. Europe in general, seems to have lower wage gaps as compared to the rest of the world. South America and Eastern Asia both pose generally higher wage gaps as the two areas general encompass developing countries. These developing countries could have more limits on advancement of women in the workplace, or have only recently began to encourage women advancement. Korea on this map holds the highest record at 39% while Sweden holds the lowest at 4%.
I used the data source, CIRI, for retrieving women's political and economic rights in a country in order to run a correlation with its respective gender wage gap. I extrapolated only the variables I need, including the country, year and the scores for women's economic and political rights. Again, I needed to convert the country names into ISO codes so I used the code below to do so. A score of 1.0 indicates limited rights while a score of 3.0 indicates full rights. I expect countries with lower economic and political rights to have higher wage gaps.
In [7]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO",
"Country or area name": "Country"})
iso = iso.drop("Numerical code", axis=1)
iso = iso.set_index("Country")
iso.head()
Out[7]:
In [8]:
#Women's Political and Economic Rights
url1 = 'https://drive.google.com/uc?export=download&id=0BxDpF6GQ-6fbbEdZYmRXekhGMFE'
hr = pd.read_csv(url1)
hr = hr[['CTRY', 'YEAR', 'WECON', 'WOPOL']]
hr.columns = ['Country', 'Year', 'Economic Rights', 'Political Rights']
hr = hr[hr.Year == 2011] #Only need data from the most recent year, which is 2011
hr.head()
Out[8]:
In [9]:
iso_hr = iso[iso.index.isin(hr['Country'])]
iso_hr.head()
Out[9]:
In [10]:
hr = hr.set_index('Country')
In [11]:
hr = hr.merge(iso, left_index=True, right_index=True)
hr.head()
Out[11]:
In [12]:
hr = hr.set_index('ISO')
In [14]:
combination = combination.set_index('ISO')
In [15]:
combination1 = combination.merge(hr, left_index=True, right_index=True)
combination1
Out[15]:
In [16]:
ax = sns.swarmplot(x="Economic Rights", y="Gender Wage Gap in % Difference", data=combination1)
fig_mpl = ax.get_figure()
I used a swarmplot for this data set because there are only 3 scores on the scale of women's economic rights: 1.0, 2.0 or 3.0. It can be deciphered that generally that at a score of 1.0, there is a larger cluster of countries with higher gender wage gaps between 20% and 25%. Examples include Brazil, Peru, and China where their gender wage gaps and economic rights score are 24%, 23%, and 22% respectively. Women's economic rights include equal pay for equal work, job security, and equality in hiring and promotion practices. This is consistent with the current economic environment of developing countries. The outliers include Poland, Hungary, and Colombia all of which have relatively low wage gaps. It's important to keep in mind that a score of 1.0 includes economic rights under law but the laws do not necessarily have strict and vigilant enforcement by the government. This could attribute to the low score but low wage gap. In addition, the economic score encompasses factors such as the right to work in occupations classified as dangerous and right to work at night which are less commonly associated factors.
A less of a distinction can be made between a score of 2 and a score of 3. But, overall, a trend can be noted where a women's economic rights does have an effect on the differential wage gap.
In [17]:
fig, ax = plt.subplots()
ax.scatter(combination1['Political Rights'], combination1['Gender Wage Gap in % Difference'], # x,y variables # size of bubbles
alpha=0.5)
ax.set_title('Gender Wage Gap vs Political Rights', loc='left', fontsize=14)
ax.set_xlabel('Political Rights')
ax.set_ylabel('Gender Wage Gap in % Difference')
Out[17]:
Political rights include a women's right to vote, hold political office, and petition government officials. Under a score of 3.0, there is a larger cluster of countries that have lower gender wage gaps between 5% and 10%. However, there is a less of a noteworthy trend between the correlation of the political rights score and the gender wage gap. Perhaps a country's womens' political rights does not necessarily correlate with their economic rights?
In [19]:
np.corrcoef(hr['Economic Rights'], hr['Political Rights'])[0, 1]
Out[19]:
There appears to be a very high correlation coefficient between womens' political and economic rights. So it can be concluded that the role of a womens' political rights is not as indicatuve of the gender wage gap as their economic rights do.
In [130]:
# Gender Wage Gap in % Difference when Political Rights is 2.0
scoretwopolitical = [2.0]
political2 = combination1[combination1['Political Rights'].isin(scoretwopolitical)].drop('Economic Rights', axis =1).drop('Year', axis =1)
political2.head()
Out[130]:
In [131]:
politicalrights2 = political2['Gender Wage Gap in % Difference']
politicalrights2.mean()
Out[131]:
In [128]:
# Gender Wage Gap in % Difference when Political Rights is 3.0
scoretwopolitical = [3.0]
political3 = combination1[combination1['Political Rights'].isin(scoretwopolitical)].drop('Economic Rights', axis =1).drop('Year', axis =1)
political3.head()
Out[128]:
In [132]:
politicalrights3 = political3['Gender Wage Gap in % Difference']
politicalrights3.mean()
Out[132]:
The hypothesis that a lower political rights score would indicate a higher gender wage gap seems to be consistent with the calculated means. However, it is important to keep in mind that the sample size for countries with a score of 3 is rather small.
For this data set, I will compare the HDI index of a country with its gender wage gap. The HDI index is defined as a composite statistic of life expectancy, education, and per capita income indicators, which are used to rank countries into four tiers of human development. I expect countries with lower HDI indexes to have higher wage gaps. For the first step, I am converting the country names to their appropriate ISO codes.
In [20]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO",
"Country or area name": "Country"})
iso = iso.drop("Numerical code", axis=1)
iso = iso.set_index("Country")
iso.head()
Out[20]:
In [21]:
file2 = 'data/2015_Statistical_Annex_Table_1.xls'
hdi = pd.read_excel(file2)
hdi = hdi[['Country','Human Development Index']]
hdi.head()
Out[21]:
In [22]:
iso_hdi = iso[iso.index.isin(hdi['Country'])]
iso_hdi.head()
Out[22]:
In [23]:
hdi = hdi.set_index('Country')
In [24]:
hdi = hdi.merge(iso, left_index=True, right_index=True)
hdi.head()
Out[24]:
In [25]:
hdi = hdi.set_index('ISO')
In [26]:
combination2 = combination.merge(hdi, left_index=True, right_index=True)
combination2.head()
Out[26]:
In [27]:
fig, ax = plt.subplots()
ax.scatter(combination2['Human Development Index'], combination2['Gender Wage Gap in % Difference'], # x,y variables # size of bubbles
alpha=0.5)
ax.set_title('Gender Wage Gap vs HDI', loc='left', fontsize=14)
ax.set_xlabel('HDI')
ax.set_ylabel('Gender Wage Gap in % Difference')
Out[27]:
In [29]:
np.corrcoef(combination2['Gender Wage Gap in % Difference'], combination2['Human Development Index'])[0, 1]
Out[29]:
When using the scatter plot and taking the correlation coefficient (-.29) into consideration, it can be established that there is a weak correlation between gender wage gaps and their respective HDI indexes. Perhaps, the HDI index contains too many broad factors, each of which has a different relation with the gender wage gap.
For this data set, I will compare the GDP per capita of a country with its gender wage gap. GDP per capita is an indicator of a country's economic activity as well as the purchasing power of its residents. It denotes the living standards and how advanced any particular economy maybe. I would expect countries with lower GDP per capitas to have higher gender wage gaps. For the first step, I am converting the country names to their appropriate ISO codes.
In [44]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO",
"Country or area name": "Country"})
iso = iso.drop("Numerical code", axis=1)
iso = iso.set_index("Country")
iso.head()
Out[44]:
In [45]:
# World Bank Data
file2 = '/Users/satyagatiganti/Desktop/Data_Bootcamp/WDI_Data.csv'
wb = pd.read_csv(file2, encoding = 'latin-1')
wb = wb.rename(columns={'Indicator Name': 'Factor'})
vlist = ['GDP per capita (constant 2010 US$)']
wb = wb[wb['Factor'].isin(vlist)]
wb = wb[['Country Name', 'Country Code', 'Factor', '2015']]
wb.columns = ['Country', 'ISO', 'Factor', 'GDP Per Capita']
wb = wb.drop('Factor', axis=1)
wb.head(5)
Out[45]:
In [46]:
iso_wb = iso[iso.index.isin(wb['Country'])]
iso_wb.head()
Out[46]:
In [47]:
wb = wb.set_index('Country')
In [48]:
wb = wb.merge(iso, left_index=True, right_index =True)
wb.head()
Out[48]:
In [49]:
wb = wb.set_index('ISO_x')
In [51]:
combination3 = combination.merge(wb, left_index=True, right_index=True)
combination3.head()
Out[51]:
In [52]:
combination3 = combination3.sort(['GDP Per Capita'], ascending = 1)
combination3.head()
Out[52]:
In [53]:
fig, ax = plt.subplots(figsize = (20,8))
combination3['Gender Wage Gap in % Difference'].plot(kind = 'bar', ax=ax)
ax.set_title('Gender Wage Gap by GDP Per Capita', loc='center', fontsize=14)
ax.set_xlabel('Countries in Ascending GDP Per Capita', fontsize = 12)
ax.set_ylabel('Gender Wage Gap in % Difference', fontsize = 12)
Out[53]:
The countries on the x-axis in the graph above are listed in ascending order of GDP Per Capita. When looking at the graph, there appears to be little correlation between GDP per capita and Gender Wage Gap in % Difference. Similar to the analysis of political rights, perhaps GDP per capita is too broad of a factor to use in evaluating the gender wage gap.
For this data set, I will compare the Maternity Leave in Weeks of a country with its gender wage gap. I would expect countries with shorter maternity leaves to have higher gender wage gaps because this might indicate discrimination in the workplace. For the first step, I am converting the country names to their appropriate ISO codes
In [127]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO",
"Country or area name": "Country"})
iso = iso.drop("Numerical code", axis=1)
iso = iso.set_index("Country")
iso.head()
Out[127]:
In [128]:
file3 = 'data/Maternity Leave Data.xlsx'
ml = pd.read_excel(file3)
ml = ml[['Country', '2015']]
ml.columns = ['Country', 'Maternity Leave in Weeks']
ml = ml.drop(0).drop(36)
ml.head()
Out[128]:
In [129]:
iso_ml = iso[iso.index.isin(ml['Country'])]
iso_ml.head()
Out[129]:
In [130]:
ml = ml.set_index('Country')
In [131]:
ml = ml.merge(iso, left_index=True, right_index =True)
ml.head()
Out[131]:
In [132]:
ml = ml.set_index('ISO')
In [133]:
combination4 = combination.merge(ml, left_index=True, right_index=True)
combination4.head()
Out[133]:
In [134]:
np.corrcoef(combination4['Maternity Leave in Weeks'], combination4['Gender Wage Gap in % Difference'])[0, 1]
Out[134]:
No figure was charted for this anaylsis because the correlation coefficient between maternity leave in weeks and the gender wage gap in % difference appears to be minimal, indicating a weak correlation. This policy appears to have little correlation with salaries, indicating there is not necessarily discrimination with regards to this.
The current gender wage gap in the US stands at nearly 20%. I will be segmenting this percentage by profession and by race.
For the BLS data set, I will extrapolate a variety of professions, each of which is already associated with its own gender wage gap. I want to compare the difference in wage gaps according to industry and the reasons for these differences.
In [10]:
#US earnings by occupation and sex
file4 = 'http://www.bls.gov/cps/cpsaat39.xlsx'
us = pd.read_excel(file4)
us = us.drop(0).drop(1).drop(2).drop(3).drop(5)
us = us.drop('Unnamed: 1', axis =1).drop('Unnamed: 2', axis=1).drop('Unnamed: 3', axis =1).drop('Unnamed: 5', axis=1)
us.columns = ['Occupation', 'Men median weekly earnings', 'Women median weekly earnings']
us.head()
Out[10]:
I extracted the men and women median weekly earnings for 12 different industries. These professions are spread across many different industries.
In [11]:
occupationlist = ['Management occupations', 'Business and financial operations occupations', 'Architecture and engineering occupations', 'Community and social service occupations','Legal occupations', 'Arts, design, entertainment, sports, and media occupations', 'Healthcare practitioners and technical occupations', 'Food preparation and serving related occupations', 'Sales and related occupations', 'Office and administrative support occupations', 'Construction and extraction occupations', 'Education, training, and library occupations']
us = us[us['Occupation'].isin(occupationlist)]
us
Out[11]:
In [12]:
# cleaning up the data and making the names of the occupations shorter
us = us.set_value(10, 'Occupation', 'Management').set_value(41, 'Occupation', 'Business & financial ops').set_value(88, 'Occupation', 'Architecture and Engineering').set_value(134, 'Occupation', 'Community & social service').set_value(143, 'Occupation', 'Legal').set_value(149, 'Occupation', 'Education').set_value(161, 'Occupation', 'Arts, entertainment, sports').set_value(181, 'Occupation', 'Healthcare').set_value(248, 'Occupation', 'Food Prep').set_value(292, 'Occupation', 'Sales').set_value(311, 'Occupation', 'Office & admin support').set_value(376, 'Occupation', 'Construction')
us
Out[12]:
In [13]:
us = us.set_index('Occupation')
In [14]:
men = dict(type="scatter",
name="Men",
mode="markers", # draw dots
x=us["Men median weekly earnings"], # x data
y=us.index, # y data
marker={"color": "Blue", "size": 12} # dot color/size
)
women = dict(type="scatter",
name="Women",
mode="markers",
x=us['Women median weekly earnings'],
y=us.index,
marker={"color": "Pink", "size": 12}
)
def draw_line(row):
sc = row.name
line = dict(type="scatter", # trace type
x=[row["Women median weekly earnings"], row["Men median weekly earnings"]], # x data
y=[sc, sc], # y data flat
mode="lines", # draw line
name=sc, # name trace
showlegend=False, # no legend entry
line={"color": "gray"} # line color
)
return line
lines = list(us.apply(draw_line, axis=1))
layout = go.Layout
layout = dict(width=600, height=750, # plot width/height
yaxis={"title": "Occupation"}, # yaxis label
title="Gender earnings disparity by profession", # title
xaxis={"title": "Median Weekly Earnings"})
# xaxis label}
# use + for two lists
data = [men, women] + lines
# build and display the figure
fig = go.Figure(data=data, layout=layout)
iplot(fig)
According to the graph above, the largest wage gap is in the legal industry. It could be due to the practice areas women take up in the law. For example, women are more likely to practice family or employment law, both of which offer lower salaries than other types of laws that men are more likely to take up. However, after doing some research, it is uncovered that women lawyers are paid less regardless of how much they work. Some attribute the problem to the negotiating powers of women versus men. Areas that include a small wage gap are construction and admin support. The nature of these industries is such that they offer less flexibility in negotiation of salaries.
In [18]:
file5 = 'http://www.bls.gov/cps/cpsaat37.xlsx'
demographics = pd.read_excel(file5)
demographics
demographics = demographics.drop(0).drop(1).drop(2).drop(3).drop(4).drop(5).drop(6).drop(7).drop(8).drop(9).drop(10).drop(11).drop(12).drop(13).drop(14).drop(15).drop(19).drop(23).drop(27).drop(31).drop(32).drop(16).drop(20).drop(24).drop(28)
demographics = demographics.drop('Unnamed: 1', axis=1).drop('Unnamed: 3', axis =1).drop('Unnamed: 2',axis =1)#.drop('HOUSEHOLD DATA ANNUAL AVERAGES 37. Median weekly earnings of full-time wage and salary workers by selected characteristics', axis =1)
demographics.columns = ['Demographics', 'Median weekly earnings']
demographics = demographics.drop('Demographics', axis=1)
vlist4 = [920.0, 680.0, 1129.0, 5631.0]
demographics = demographics[demographics['Median weekly earnings'].isin(vlist)]
demographics['Race'] = ['White', 'Black or African American', 'Asian', 'Hispanic or Latino']
demographics['Women Median Weekly Earnings'] = [743.0, 615.0, 877.0, 556.0]
demographics['Men Median Weekly Earnings']= [920.0, 680.0, 1129.0, 631.0]
demographics = demographics.drop('Median weekly earnings', axis=1)
demographics = demographics.set_index('Race')
demographics
Out[18]:
In [19]:
Women = dict(type="bar", # trace type
orientation="h", # make bars horizontal
name="Women", # legend entry
x=demographics["Women Median Weekly Earnings"], # x data
y=demographics.index, # y data
marker={"color": "Pink"} # blue bars
)
Men = dict(type="bar", # trace type
orientation="h", # horizontal bars
name="Men", # legend entry
x=demographics["Men Median Weekly Earnings"], # x data
y=demographics.index, # y data
marker={"color": "Blue"} # pink bars
)
layout = dict(width=650, height=750, # plot width/height
yaxis={"title": "Race"}, # yaxis label
title="Earnings by Gender and Race", # title
xaxis={"title": "Median Weekly Earnings"} # xaxis label}
)
iplot(go.Figure(data=[Men, Women], layout=layout))
According to the bar graph above, the largest gender wage among races is obseved between Asian women and Asian men. Studies have shows that this wage gap can be attributable to discrimination in the work place, especially among Asian women. Other contributing factors, according to Pew Research, are risk aversion and negotiation skills; both of which Asians do not mobilize as heavily as by other races. Interestingly, negotiation plays a significant role in explaining the wage gap, whether segmented by occupation or by race. Another theory is that perhaps women do not necessarily seek the same high paying STEM fields as men do because of the experiences rooted in the prejudices, such as higher encouragement for men to pursue these professions. An important point to distinguish is that both Asian men and women make more money than their black, hispanic, or white counterparts. However, as Asian women are climbing the career ladder and thereby earning higher salaries, there is far more discrimination present at the executive levels. This could account for the larger wage gap among Asians.
Of the factors I used to compate the gender wages gaps in the world with, only the economic rights of a country seems to have correlation with the gender wage gaps. This indicates that the gender wage gap is a highly complex situation that has no direct and consistent reasons for its existence. The gender wage gap is attributable to an individual country's culture, government, and economic opportunities.
Segmenting the wage gap in the United States allows us to observe where women are making headways, and where there is still progress to be made. Professions such as food prep, construction, and admin support shower smaller wage gaps where as legal and sales demonstrate higher wage gaps. This alludes to the skill level, work culture, and long lasting prejudices imbedded within these professions. In terms of race, while women on average, are earning more than they did in the past, there still exists high barriers at more senior and executive levels.
Overall, this analysis demonstrates that women are certainly making headway in closing the gender wage gap but there is still room for improvement.
In [ ]: