Analyzing the case of discrimination against women during the admission process at UC-Berkeley in 1973

Background

In 1973, the University of California-Berkeley (UC-Berkley) was sued for sex discrimination. Its admission data showed that men applying to graduate school at UC-Berkley were more likely to be admitted than women.

The graduate schools had just accepted 44% of male applicants but only 35% of female applicants. The difference was so great that it was unlikely to be due to chance.

By looking at the data more closely, you may realize that there is more to the story than meets the eye.

Data

Download the 1973 UC-Berkley Graduate School Admission Data and take a look yourself. This dataset contains information about the six most popular departments.

Question

Do you agree that UC-Berkeley discriminated against women during the admission process?

Solution

Let us first access the dataset.



In [2]:

    
import matplotlib
%matplotlib inline
%automagic









    



Automagic is ON, % prefix IS NOT needed for line magics.



In [10]:

    
from cStringIO import StringIO
import pandas
from pandas import DataFrame
import requests

############Wrangling Raw Data######################
data_site = r'http://www.calvin.edu/~stob/data/Berkeley.csv'
#get the csv file as an IO stream
admit_io = StringIO(requests.get(data_site).text)

#strip away nonsense from each line and get each line as a list of cells
admit_io = [line.strip('\n').strip('\r').split(',') for line in admit_io]

#get the field names from the first list
fieldnames = admit_io[0]

df=DataFrame(data=admit_io[1:], columns=fieldnames)
df['Freq'] = df['Freq'].astype(float64)
print df









    



       Admit  Gender Dept  Freq
0   Admitted    Male    A   512
1   Rejected    Male    A   313
2   Admitted  Female    A    89
3   Rejected  Female    A    19
4   Admitted    Male    B   353
5   Rejected    Male    B   207
6   Admitted  Female    B    17
7   Rejected  Female    B     8
8   Admitted    Male    C   120
9   Rejected    Male    C   205
10  Admitted  Female    C   202
11  Rejected  Female    C   391
12  Admitted    Male    D   138
13  Rejected    Male    D   279
14  Admitted  Female    D   131
15  Rejected  Female    D   244
16  Admitted    Male    E    53
17  Rejected    Male    E   138
18  Admitted  Female    E    94
19  Rejected  Female    E   299
20  Admitted    Male    F    22
21  Rejected    Male    F   351
22  Admitted  Female    F    24
23  Rejected  Female    F   317

We are given a dataset that contains the Gender across 6 departments, with each department classifying applicants based on their admission status as either 'Admitted' or 'Rejected'.

Let us find out the total ratio of male admissions across all departments as against total ratio of femael admissions across all departments Indeed, when we look at the table above, We could consider this a frequency graph.



In [12]:

    
#total ratio of male admissions across all departments
print "total ratio of male admissions across all departments: {}".\
format(df[df['Gender']=='Male'][df.Admit=='Admitted'].Freq.sum()/df[df['Gender']=='Male'].Freq.sum())

#total ratio of female admissions across all departments
print "total ratio of female admissions across all departments: {}".\
format(df[df['Gender']=='Female'][df.Admit=='Admitted'].Freq.sum()/df[df['Gender']=='Female'].Freq.sum())









    



total ratio of male admissions across all departments: 0.445187662579
total ratio of female admissions across all departments: 0.303542234332

Indeed, we can see that male admissions across all departments are roughly at ~44.5% and that of female are ~30%. But students were not applying to all departments and so taking an admission ratio across all departments may not be the correct approach. Finding the admission ratio per department per gender might be a better approach.

Since prospective students had applied for a specific department, we cannot say anything about the gender biasing unless we investigate department wise what the figures are actually saying.

Inshort, we want to find out admission ratio across each department based on the admission ratios of male per department and admission ratios of female per department.

For that, lets first create a table each for male and female students admitted in 1976.

Descriptive Analysis



In [14]:

    
df_m = df[df['Gender'] == 'Male']
df_f = df[df['Gender'] == 'Female']

#Get the admissions ratios
#each list comprehension is a Series. Append them together in one Series
res_m=reduce(lambda x,y: x.append(y), [df_m[df_m.Admit=='Admitted']\
[df_m.Dept==dept].Freq/df_m[df_m['Dept']==dept].Freq.sum() \
for dept in df_m['Dept'][::2]])

#join the Admission Ratio col to main DF
df_m = df_m[df_m.Admit=='Admitted'].join\
(pandas.Series(res_m, name='Male Admission Ratio', dtype=float64))

#Get the admissions ratios
#each list comprehension is a Series. Append them together in one Series
res_f=reduce(lambda x,y: x.append(y), [df_f[df_f.Admit=='Admitted']\
[df_f.Dept==dept].Freq/df_f[df_f['Dept']==dept].Freq.sum() \
for dept in df_f['Dept'][::2]])

#join the Admission Ratio col to main DF
df_f = df_f[df_f.Admit=='Admitted'].join\
(pandas.Series(res_f, name='Female Admission Ratio', dtype=float64))

#male admission ratios table
print "male admission ratios table:\n"
print df_m

print 

#female admission ratios table
print "female admission ratios table:\n"
print df_f









    



male admission ratios table:

       Admit Gender Dept  Freq  Male Admission Ratio
0   Admitted   Male    A   512              0.620606
4   Admitted   Male    B   353              0.630357
8   Admitted   Male    C   120              0.369231
12  Admitted   Male    D   138              0.330935
16  Admitted   Male    E    53              0.277487
20  Admitted   Male    F    22              0.058981

female admission ratios table:

       Admit  Gender Dept  Freq  Female Admission Ratio
2   Admitted  Female    A    89                0.824074
6   Admitted  Female    B    17                0.680000
10  Admitted  Female    C   202                0.340641
14  Admitted  Female    D   131                0.349333
18  Admitted  Female    E    94                0.239186
22  Admitted  Female    F    24                0.070381



In [19]:

    
bar(df_m['Freq'])









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-683189ccaa77> in <module>()
      1 
----> 2 bar(df_m['Freq'])

TypeError: bar() takes at least 2 arguments (1 given)



In [ ]:



In [24]:

    
##########Getting Descriptive Statistics#################

#take Population mean Admission ratio across all departments
male_admits = (df_m['Male Admission Ratio']).tolist()
female_admits = (df_f['Female Admission Ratio']).tolist()

#take the population size of above admission metric
male_sample_size = len(df_m['Male Admission Ratio'])
female_sample_size = len(df_m['Male Admission Ratio'])

#take admission ratios mean
male_mean = df_m['Male Admission Ratio'].mean()
female_mean = df_f['Female Admission Ratio'].mean()


print "Mean of departmental admission ratios across all dept. for males:\n%s"\
% male_mean
print
print "Mean of departmental admission ratios across all dept. for females:\n%s"\
% female_mean









    



Mean of departmental admission ratios across all dept. for males:
0.381266228122

Mean of departmental admission ratios across all dept. for females:
0.41726919986



In [34]:

    
%cd C:\\Users\\Asus s\\Desktop\\AKULs Files\\principles_of_computing\\probability_combinatorics
%pwd()
from py_variance_std import se_pooled_t, df_independent_sample, critical_t,calc_t_independent_sample, t_percentile









    



C:\Users\Asus s\Desktop\AKULs Files\principles_of_computing\probability_combinatorics



In [46]:

    
#Take Pooled STD DEV of Independent Male Female Admission Ratios
pooled_se_admit = se_pooled_t(male_admits, female_admits)

#Take Independent DF
independent_pop_sample_df = \
df_independent_sample(male_sample_size, female_sample_size)

#Two sided Critical T Value
critical_t_val = critical_t(95, independent_pop_sample_df, 0)

#T score
t_score = calc_t_independent_sample(male_mean, female_mean, pooled_se_admit)

t_per = t_percentile(t_score, independent_pop_sample_df)

print "Is T Percentile significant?: %s < 0.05 => %s" %(t_per, t_per < 0.05)

print "Is T score significant?: %s > %s => %s" %(t_score,critical_t_val,t_score > critical_t_val)

marginal_err = critical_t_val * pooled_se_admit

print r"For a 95 percentile C.I. of Female Admissions Ratio mean of %s is between %s and %s" \
%(female_mean, female_mean-marginal_err, female_mean+marginal_err)









    



Is T Percentile significant?: 0.815178360968 < 0.05 => False
Is T score significant?: 0.24 > 2.228 => False
For a 95 percentile C.I. of Female Admissions Ratio mean of 0.41726919986 is between 0.0830691998598 and 0.75146919986



In [ ]: