Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.
Note that the 'b' and 'w' values in race are assigned randomly to the resumes.
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
Out[2]:
In [20]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
data.head()
Out[20]:
In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)
Out[4]:
In [5]:
#number of callbacks for white-sounding names
sum(data[data.race=='w'].call)
Out[5]:
In [6]:
black = data[data.race=='b']
white = data[data.race=='w']
In [18]:
len(black)
Out[18]:
In [19]:
len(white)
Out[19]:
In [8]:
black_called = len(black[black['call']==True])
black_notcalled = len(black[black['call']== False])
white_called = len(white[white['call']==True])
white_notcalled = len(white[white['call']== False])
In [16]:
#probability of a white sounding name getting call back
prob_white_called = white_called/len(white)
prob_white_called
Out[16]:
In [17]:
prob_black_called = black_called/len(black)
prob_black_called
Out[17]:
In [35]:
#probablity of a resume gettting a callback
prob_called = sum(data.call)/len(data)
prob_called
Out[35]:
In [15]:
results = pd.DataFrame({'black':{'called':black_called,'not_called':black_notcalled},
'white':{'called':white_called,'not_called':white_notcalled}})
results
Out[15]:
CLT applies (in it's most common form) for a sufficiently large set of identically distributed independent random variables. Each of these data points if independent and drawn from the same probability distribution (we assume). If the resumes were sent out to a representative collection of potential employers then this shouldn't be an issue.
As well we see from our results table that for both categories $np >10$ and $n(1-p) >10$.
We note however, that unlike the previous project, there are only two binary states for two types of data points, making for four categories in total. Thus, an appropriate statistical test would be Pearson's $\chi^{2}$ test. It is a a statsitcal test which applies to sets of categorical data like we have here. The CLT tells us that for large sample sizes the disbtibution will tend toward a multivariate normal distribution.
The alternative would be to directly compare the rate of call back between the two groups, which would be a two-tailed z-test of the difference of two proportions. Both of these methods were covered in the Khan academy links provided by spring board.
The null hypotheses is that what race a name sounds like should not have any predictive effect on the rate of callbacks. The alternate hypotheses is that there will be a statistical difference between the two groups. Since the number of black and white names was the same, and the names were randomly assigned to identical resumes (thereby removing the potential real-world biases of different education, experience levels or other advtanges/disadvatages), we expect to see the same number of total callbacks under the null hypothesis. Clearly this is not the case, as we see white names have a $9.65\%$call back rate in contrast to black names having $6.45\%$ call backs.
In [48]:
#get expected proportions of each group
total_called = sum(data.call)
total_notcalled = len(data) - sum(data.call)
total_called/2
Out[48]:
In [49]:
#Use a chisquare test with 1 degree of freedom since (#col-1)*(#rows-1) = 1.
result_freq = [black_called, white_called, black_notcalled, white_notcalled]
expected_freq = [total_called/2,total_called/2,total_notcalled/2,total_notcalled/2]
stats.chisquare(f_obs=result_freq, f_exp = expected_freq, ddof=2)
Out[49]:
We obtain $\chi^2 = 16.9$ with a p-value of $3.98\times10^{-5}$. This is highly significant, well below standard thresholds. We can conclude that it is very likely that perceived race of name plays a role in the rate of callbacks.
In [67]:
#calculate standard error, using pooled data.
stderr = np.sqrt(prob_called*(1-prob_called)*(1/len(black)+1/len(white)))
print(stderr)
#get z score
z_score = (prob_white_called - prob_black_called) / stderr
z_score
Out[67]:
In [63]:
#Can also compute a p-value for the difference in proportions directly,
#apart from that obtained from our chi-squared test. Note that this equivalent, gives same result.
#Use norm.sf since it's a positive z value. (more accurate than 1-.cdf)
p_value2 = stats.norm.sf(z_score)*2
p_value2
Out[63]:
In [66]:
#Now a 95 percent confidence interval, two tailed corresponds to
z_critical = stats.norm.ppf(.975)
z_critical
Out[66]:
In [70]:
std_err_unpooled = np.sqrt(prob_black_called *(1-prob_black_called)/len(black)+
prob_white_called*(1-prob_white_called)/len(white))
conf_interval = [prob_white_called-prob_black_called - z_critical*stderr,
prob_white_called-prob_black_called + z_critical*stderr ]
conf_interval
Out[70]:
Thus we are confident that there is a 95% chance that the true difference in callback rate for black and white names is within this range.
For our p-value test, we found that we would expect this result from random chance less than 4 times in 100000. Our confidence is thus very high that the perceived effect is not due to random chance.
As for our confidence interval, we have calculated the range by setting our error rate at 5%. So less than 5% would random chance lead to a difference in proportions greater than
In [23]:
len(white[white['sex']=='f'])
Out[23]:
In [24]:
len(white[white['sex']=='m'])
Out[24]:
In [26]:
len(black[black['sex']=='f'])
Out[26]:
In [25]:
len(black[black['sex']=='m'])
Out[25]:
In [32]:
print(sum(white[white['sex']=='f'].call)/len(white[white['sex']=='f']))
print(sum(white[white['sex']=='m'].call)/len(white[white['sex']=='m']))
In [33]:
print(sum(black[black['sex']=='f'].call)/len(black[black['sex']=='f']))
print(sum(black[black['sex']=='m'].call)/len(black[black['sex']=='m']))
We do note that there is not an even split by sex however. While female resumes have a higher callback rate than male resumes among both groups, it is the black sounding resumes that had more feminine names, so this should work to shrink the difference.
It is possible that some of the difference in call back rates is attributed to the sex of a name as well as its percieved race. Using the pool of names,one should check if the white name database housed more gender-neutral names for the resume. Some of the statistical significance in the difference between the two groups may be attributed therefore to a different form of bias on the part of potential employers. When looking through the paper referenced in the link given at the start of the notebook however, we see that the first names do not seem gender-netural. The possible exception is 'Brett' as a white masculine name, which can also be used as a feminime name, but has fallen out of favour in modern times. Therefore
So by doing a bit of separate legwork we are confident that are hypotheses statistical significance is not undermined by the sex split.
In [ ]:
white_called =sum(data[data.race=='w'].call)
black_called = sum(data[data.race=='b'].call)
black_nocall = sum(data[])