# Examining Racial Discrimination in the US Job Market

### Background

Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data

In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises

You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions in this notebook below and submit to your Github account.

1. What test is appropriate for this problem? Does CLT apply?
2. What are the null and alternate hypotheses?
3. Compute margin of error, confidence interval, and p-value.
4. Write a story describing the statistical significance in the context or the original problem.
5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown:

#### Import the libs

``````

In :

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

``````

``````

In :

``````
``````

In :

``````
``````

Out:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

id
education
ofjobs
yearsexp
honors
volunteer
military
empholes
occupspecific
...
compreq
orgreq
manuf
transcom
bankreal
busservice
othservice
missind
ownership

0
b
1
4
2
6
0
0
0
1
17
...
1.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0

1
b
1
3
3
6
0
1
1
0
316
...
1.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0

2
b
1
4
1
6
0
0
0
0
19
...
1.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0

3
b
1
3
4
6
0
1
0
1
313
...
1.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0

4
b
1
3
3
22
0
0
0
0
313
...
1.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
Nonprofit

5 rows × 65 columns

``````

### Dataset with résumés that received callbacks

``````

In :

data_calls = data[['id','race','call', 'education', 'yearsexp']].loc[data['call']==1]

``````
``````

Out:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

id
race
call
education
yearsexp

85
b
w
1.0
2
7

95
b
w
1.0
2
4

105
b
w
1.0
4
6

107
b
b
1.0
4
6

126
b
b
1.0
4
9

``````

### Callbacks for white and black-sounding names (a summary)

``````

In :

# total résumés in the dataset
n = data.shape

# Callback / white/black-sounding name
total_call = data['id'].loc[data['call'] ==1.0].count()
call_w = data['id'].loc[(data['race'] =='w') & (data['call'] ==1.0)].count()
call_b = data['id'].loc[(data['race'] =='b') & (data['call'] ==1.0)].count()

# Summary
print("Total résumés = %d Curricula Vitae (CV)" % n)
print("Total callbacks = %d calls (%.2f%% all CV)" % (total_call,(100*(total_call/n))))
print("...")
print("...Callback for with white-sounding name = %d or %.2f%% from résumés with callbacks;" % (call_w, (100*(call_w/total_call))))
print("...Callback for black-sounding name = %d or %.2f%% from résumés with callbacks." % (call_b, (100*(call_b/total_call))))
print("...")
print("...Callback for white-sounding name is %.2f%% greater than for black-sounding names" % (100*((call_w - call_b)/call_w)))

``````
``````

Total résumés = 4870 Curricula Vitae (CV)
Total callbacks = 392 calls (8.05% all CV)
...
...Callback for with white-sounding name = 235 or 59.95% from résumés with callbacks;
...Callback for black-sounding name = 157 or 40.05% from résumés with callbacks.
...
...Callback for white-sounding name is 33.19% greater than for black-sounding names

``````

### Callbacks: a visual presentation

``````

In :

# create data
label_call_w = "white-sounding names - " + '{:.5}'.format(str(100*(call_w/total_call))) +"%"
label_call_b = "black-sounding names - " + '%.5s' % format(str(100*(call_b/total_call))) +"%"
names= label_call_w, label_call_b
size=[call_w, call_b]

# Create a circle for the center of the plot
inner_circle=plt.Circle( (0,0), 0.7, color='white')

# Give color names
plt.pie(size, labels=names, colors=['blue','skyblue'], labeldistance=1.0, wedgeprops = { 'linewidth' : 7, 'edgecolor' : 'white' })
p=plt.gcf()
plt.show()

``````
``````

``````

### 1.What test is appropriate for this problem? Does CLT apply?

• We have a problem with variables that represents categorical values: 'b', 'w'.
• For this type of analysis - relationship of two categorical variables -, their distribution in the dataset is often displayed in an R×C table, also referred to as a contingency table .
• In order to do hypothesis testing with categorical variables we use the chi-square test .
• Table 1. White-sounding and black-souding names versus callbacks
Callback No callback
white-sounding names X% Y%
black-sounding names Z% W%

Yes.

### 2.What are the null and alternate hypotheses?

2.1 - Null hypothesis: H0 => "Race has NOT a significant impact on the rate of callbacks for résumés".

2.2 - Alternate hypothesis: Ha => "Race has a significant impact on the rate of callbacks for résumés".

### 3. Compute margin of error, confidence interval, and p-value

``````

In :

# Dataframe with only the résumés with callback (data_calls)

``````
``````

Out:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

id
race
call
education
yearsexp

85
b
w
1.0
2
7

95
b
w
1.0
2
4

105
b
w
1.0
4
6

107
b
b
1.0
4
6

126
b
b
1.0
4
9

``````

#### Sumary:

``````

In :

# Total résumés
n = data.shape

# Total callbacks
n_call = data_calls.shape

# Résumés white and black sounding names
total_white = data['id'].loc[(data['race'] =='w')].count()
total_black = data['id'].loc[(data['race'] =='b')].count()

# Perc.(%) white and black sounding names from total
perc_tot_white = total_white/n
perc_tot_black = total_black/n

# Perc.(%) white and black sounding names with callback
perc_call_white = call_w/n_call
perc_call_black = call_b/n_call

print("Total résumés with callback = %d" %n_call)
print("---------------------------------")
print("Perc.(%%) white-sounding names from total = %.2f%%" %(perc_tot_white*100))
print("Perc.(%%) black-sounding names from total = %.2f%%" %(perc_tot_black*100))
print("---------------------------------")
print("Perc.(%%) white-sounding names from callbacks = %.2f%%" %(perc_call_white*100))
print("Perc.(%%) black-sounding names from total = %.2f%%" %(perc_call_black*100))
print("---------------------------------")

``````
``````

Total résumés with callback = 392
---------------------------------
Perc.(%) white-sounding names from total = 50.00%
Perc.(%) black-sounding names from total = 50.00%
---------------------------------
Perc.(%) white-sounding names from callbacks = 59.95%
Perc.(%) black-sounding names from total = 40.05%
---------------------------------

``````

#### Barplot: 'race' versus 'callback'

``````

In :

rc = data[['race','call']]
sns.barplot(x='race', y='call', data=rc)
plt.show()

``````
``````

``````

#### Investigation: is there any relation between white/black-sounding names and education versus years of experience?

``````

In :

sns.set()

# Plot tip as a function of toal bill across days
g = sns.lmplot(x="yearsexp", y="education", hue="race",
truncate=True, size=7, data=data_calls)

g.set_axis_labels("Years of experience", "Years of formal education")

``````
``````

Out:

<seaborn.axisgrid.FacetGrid at 0x10d652048>

``````
``````

In :

# Arrays with callback:
#... w_callback = white-sounding name
#... b_callback = black-sounding name
w_callback = data_calls.iloc[:, 1][data_calls.race == 'w'].values
b_callback = data_calls.iloc[:, 1][data_calls.race == 'b'].values

``````

#### Margin of error and confidence intervals for 'w' and 'b' groups

• For alfa=0,05 and degrees of freedom (df)=1, we have x2=3.84 or x=1.96
• White-sounding names
``````

In :

critical_value = 1.96
margin_error = np.sqrt((perc_call_white*(1-perc_call_white)/n))*critical_value
low_critical_value = perc_call_white - margin_error
high_critical_value = perc_call_white + margin_error
print("White-sounding names:")
print("---------------------")
print("Mean: ", perc_call_white)
print("Margin of error: ", margin_error)
print("---------------------")
print("Confidence interval: ")
print("...From = %.2f" %low_critical_value)
print("...To   = %.2f" %high_critical_value)

``````
``````

White-sounding names:
---------------------
Mean:  0.599489795918
Margin of error:  0.0137622448744
---------------------
Confidence interval:
...From = 0.59
...To   = 0.61

``````
• Black-sounding names
``````

In :

critical_value = 1.96
margin_error = np.sqrt((perc_call_black*(1-perc_call_black)/n))*critical_value
low_critical_value = perc_call_black - margin_error
high_critical_value = perc_call_black + margin_error
print("Black-sounding names:")
print("---------------------")
print("...Mean: ", perc_call_black)
print("...Margin of error: ", margin_error)
print("---------------------")
print("Confidence interval: ")
print("...From = %.2f" %low_critical_value)
print("...To   = %.2f" %high_critical_value)

``````
``````

Black-sounding names:
---------------------
...Mean:  0.400510204082
...Margin of error:  0.0137622448744
---------------------
Confidence interval:
...From = 0.39
...To   = 0.41

``````

### 4. Write a story describing the statistical significance in the context or the original problem.

Based on data, with 95% confidence, we know that 40% of the candidates that received a call has a black-sounding name. Considering a margin of error of XXX, we can not afirm that was a racial bias (pro white-sounding names).

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

``````

In [ ]:

``````

## References:

``````

In [ ]:

``````