Examining racial discrimination in the US job market

Background

Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

Data

In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

Exercise

You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions in this notebook below and submit to your Github account.

What test is appropriate for this problem? Does CLT apply?
What are the null and alternate hypotheses?
Compute margin of error, confidence interval, and p-value.
Discuss statistical significance.

You can include written notes in notebook cells using Markdown:

In the control panel at the top, choose Cell > Cell Type > Markdown
Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

Resources

Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html
Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet



In [2]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))









    Out[2]:



In [20]:

    
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
data.head()









    Out[20]:






  
    
      
      id
      ad
      education
      ofjobs
      yearsexp
      honors
      volunteer
      military
      empholes
      occupspecific
      ...
      compreq
      orgreq
      manuf
      transcom
      bankreal
      trade
      busservice
      othservice
      missind
      ownership
    
  
  
    
      0
      b
      1
      4
      2
      6
      0
      0
      0
      1
      17
      ...
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      
    
    
      1
      b
      1
      3
      3
      6
      0
      1
      1
      0
      316
      ...
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      
    
    
      2
      b
      1
      4
      1
      6
      0
      0
      0
      0
      19
      ...
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      
    
    
      3
      b
      1
      3
      4
      6
      0
      1
      0
      1
      313
      ...
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      
    
    
      4
      b
      1
      3
      3
      22
      0
      0
      0
      0
      313
      ...
      1.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      Nonprofit
    
  

5 rows × 65 columns



In [4]:

    
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)









    Out[4]:





157.0



In [5]:

    
#number of callbacks for white-sounding names
sum(data[data.race=='w'].call)









    Out[5]:





235.0



In [6]:

    
black = data[data.race=='b']
white = data[data.race=='w']



In [18]:

    
len(black)









    Out[18]:





2435



In [19]:

    
len(white)









    Out[19]:





2435



In [8]:

    
black_called = len(black[black['call']==True])
black_notcalled = len(black[black['call']== False])
white_called = len(white[white['call']==True])
white_notcalled = len(white[white['call']== False])



In [16]:

    
#probability of a white sounding name getting call back
prob_white_called = white_called/len(white)
prob_white_called









    Out[16]:





0.09650924024640657



In [17]:

    
prob_black_called = black_called/len(black)
prob_black_called









    Out[17]:





0.06447638603696099



In [35]:

    
#probablity of a resume gettting a callback
prob_called = sum(data.call)/len(data)
prob_called









    Out[35]:





0.080492813141683772



In [15]:

    
results = pd.DataFrame({'black':{'called':black_called,'not_called':black_notcalled},
                       'white':{'called':white_called,'not_called':white_notcalled}})
results









    Out[15]:






  
    
      
      black
      white
    
  
  
    
      called
      157
      235
    
    
      not_called
      2278
      2200

Does the central limit theorem apply? What statistical test is appropriate?

CLT applies (in it's most common form) for a sufficiently large set of identically distributed independent random variables. Each of these data points if independent and drawn from the same probability distribution (we assume). If the resumes were sent out to a representative collection of potential employers then this shouldn't be an issue.

As well we see from our results table that for both categories $np >10$ and $n(1-p) >10$.

We note however, that unlike the previous project, there are only two binary states for two types of data points, making for four categories in total. Thus, an appropriate statistical test would be Pearson's $\chi^{2}$ test. It is a a statsitcal test which applies to sets of categorical data like we have here. The CLT tells us that for large sample sizes the disbtibution will tend toward a multivariate normal distribution.

The alternative would be to directly compare the rate of call back between the two groups, which would be a two-tailed z-test of the difference of two proportions. Both of these methods were covered in the Khan academy links provided by spring board.

What are the null and alternate hypotheses?

The null hypotheses is that what race a name sounds like should not have any predictive effect on the rate of callbacks. The alternate hypotheses is that there will be a statistical difference between the two groups. Since the number of black and white names was the same, and the names were randomly assigned to identical resumes (thereby removing the potential real-world biases of different education, experience levels or other advtanges/disadvatages), we expect to see the same number of total callbacks under the null hypothesis. Clearly this is not the case, as we see white names have a $9.65\%$call back rate in contrast to black names having $6.45\%$ call backs.

Compute margin of error, confidence interval, and p-value



In [48]:

    
#get expected proportions of each group 
total_called = sum(data.call)
total_notcalled = len(data) - sum(data.call)
total_called/2









    Out[48]:





196.0



In [49]:

    
#Use a chisquare test with 1 degree of freedom since (#col-1)*(#rows-1) = 1. 
result_freq = [black_called, white_called, black_notcalled, white_notcalled]
expected_freq = [total_called/2,total_called/2,total_notcalled/2,total_notcalled/2]

stats.chisquare(f_obs=result_freq, f_exp = expected_freq, ddof=2)









    Out[49]:





Power_divergenceResult(statistic=16.879050414270221, pvalue=3.983886837585076e-05)

We obtain $\chi^2 = 16.9$ with a p-value of $3.98\times10^{-5}$. This is highly significant, well below standard thresholds. We can conclude that it is very likely that perceived race of name plays a role in the rate of callbacks.



In [67]:

    
#calculate standard error, using pooled data. 
stderr = np.sqrt(prob_called*(1-prob_called)*(1/len(black)+1/len(white)))
print(stderr)
#get z score
z_score = (prob_white_called - prob_black_called) / stderr
z_score









    



0.00779689403617






    Out[67]:





4.1084121524343464



In [63]:

    
#Can also compute a p-value for the difference in proportions directly, 
#apart from that obtained from our chi-squared test. Note that this equivalent, gives same result.
#Use norm.sf since it's a positive z value. (more accurate than 1-.cdf)
p_value2 = stats.norm.sf(z_score)*2
p_value2









    Out[63]:





3.9838868375850767e-05



In [66]:

    
#Now a 95 percent confidence interval, two tailed corresponds to
z_critical = stats.norm.ppf(.975)
z_critical









    Out[66]:





1.959963984540054



In [70]:

    
std_err_unpooled = np.sqrt(prob_black_called *(1-prob_black_called)/len(black)+ 
                           prob_white_called*(1-prob_white_called)/len(white))
conf_interval = [prob_white_called-prob_black_called - z_critical*stderr,
                 prob_white_called-prob_black_called + z_critical*stderr ]
conf_interval









    Out[70]:





[0.016751222707276352, 0.047314485711614819]

Thus we are confident that there is a 95% chance that the true difference in callback rate for black and white names is within this range.

statistical significance

For our p-value test, we found that we would expect this result from random chance less than 4 times in 100000. Our confidence is thus very high that the perceived effect is not due to random chance.

As for our confidence interval, we have calculated the range by setting our error rate at 5%. So less than 5% would random chance lead to a difference in proportions greater than



In [23]:

    
len(white[white['sex']=='f'])









    Out[23]:





1860



In [24]:

    
len(white[white['sex']=='m'])









    Out[24]:





575



In [26]:

    
len(black[black['sex']=='f'])









    Out[26]:





1886



In [25]:

    
len(black[black['sex']=='m'])









    Out[25]:





549



In [32]:

    
print(sum(white[white['sex']=='f'].call)/len(white[white['sex']=='f']))
print(sum(white[white['sex']=='m'].call)/len(white[white['sex']=='m']))









    



0.0989247311828
0.0886956521739



In [33]:

    
print(sum(black[black['sex']=='f'].call)/len(black[black['sex']=='f']))
print(sum(black[black['sex']=='m'].call)/len(black[black['sex']=='m']))









    



0.0662778366914
0.0582877959927

We do note that there is not an even split by sex however. While female resumes have a higher callback rate than male resumes among both groups, it is the black sounding resumes that had more feminine names, so this should work to shrink the difference.

It is possible that some of the difference in call back rates is attributed to the sex of a name as well as its percieved race. Using the pool of names,one should check if the white name database housed more gender-neutral names for the resume. Some of the statistical significance in the difference between the two groups may be attributed therefore to a different form of bias on the part of potential employers. When looking through the paper referenced in the link given at the start of the notebook however, we see that the first names do not seem gender-netural. The possible exception is 'Brett' as a white masculine name, which can also be used as a feminime name, but has fallen out of favour in modern times. Therefore

So by doing a bit of separate legwork we are confident that are hypotheses statistical significance is not undermined by the sex split.



In [ ]:

    
white_called =sum(data[data.race=='w'].call)
black_called = sum(data[data.race=='b'].call)
black_nocall = sum(data[])

	id	ad	education	ofjobs	yearsexp	volunteer	military	empholes	occupspecific	...	compreq	orgreq	manuf	othservice	ownership
0	b	1	4	2	6	0	0	1	17	...	1.0	0.0	1.0	0.0
1	b	1	3	3	6	1	1	0	316	...	1.0	0.0	1.0	0.0
2	b	1	4	1	6	0	0	0	19	...	1.0	0.0	1.0	0.0
3	b	1	3	4	6	1	0	1	313	...	1.0	0.0	1.0	0.0
4	b	1	3	3	22	0	0	0	313	...	1.0	1.0	0.0	1.0	Nonprofit