Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.
Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In [43]:
import pandas as pd
import numpy as np
import math
from scipy import stats
import matplotlib.pyplot as plt
In [39]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
data.head(5)
Out[39]:
(1) The callback rate for each group (black-sounding and white-sounding names) follows a binomial distribution, which approaches a normal distribution when np>10 and n(1-p)>10 (see below). Therefore, the difference in call back rate follows a normal distribution, and the CLT can be applied.
In [40]:
n_b = data[data.race=='b'].call.count() #total number of black-sounding names
p_b = np.sum(data[data.race=='b'].call)/n_b #callback rate for black-sounding names
n_w = data[data.race=='w'].call.count() #total number of white-sounding names
p_w = np.sum(data[data.race=='w'].call)/n_w #callback rate for white-sounding names
print('np for black-sounding names: ' + str(n_b * p_b))
print('n(1-p) for black-sounding names: ' + str(n_b * (1-p_b)))
print('np for white-sounding names: ' + str(n_w * p_w))
print('n(1-p) for white-sounding names: ' + str(n_w * (1-p_w)))
(2) A two sample z test is appropriate to test if there is a significant difference in callback rate $p$ between resumes with black-sounding and white-sounding names:
$H_0: p_B = p_W$
$H_A: p_B \neq p_w$
In this case, p value is 3.86e-05, marginal error is: 0.015 and 95 % Confidence Interval is [-0.047, -0.017]. Therefore, we are able to reject the null hypothesis that the callback rate between resumes with black-sounding and white-sounding names are the same.
In [66]:
mean_diff = p_b - p_w
se = math.sqrt((p_b* (1-p_b))/n_b + (p_w* (1-p_w))/n_w)
z = abs(mean_diff)/se
p_z = (1-stats.norm.cdf(z))*2
me = 1.96*se
ub = mean_diff + me
lb = mean_diff -me
print('p value is: ' + str(p_z))
print('marginal error is: ' + str(me))
print('95 % Confidence Interval: [' + str(lb) + ', ' + str(ub) + ']')
(3) In summary, there is a significant difference in callback rate between resumes with black-sounding and white-sounding names. The callback rate for resumes with black-sounding names is significant lower than that for resumes with white-sounding names. However, it does not mean that race/name is the most important factor in callback rate. We can also evaluate if there are confounding variables (such as years of experience and education). If not, we can further evaluate if any other variables have more significant effects than race/name.
In [ ]: