In 1973, the University of California-Berkeley (UC-Berkley) was sued for sex discrimination. Its admission data showed that men applying to graduate school at UC-Berkley were more likely to be admitted than women.
The graduate schools had just accepted 44% of male applicants but only 35% of female applicants. The difference was so great that it was unlikely to be due to chance.
By looking at the data more closely, you may realize that there is more to the story than meets the eye.
Download the 1973 UC-Berkley Graduate School Admission Data and take a look yourself. This dataset contains information about the six most popular departments.
Do you agree that UC-Berkeley discriminated against women during the admission process?
Let us first access the dataset.
In [2]:
import matplotlib
%matplotlib inline
%automagic
In [10]:
from cStringIO import StringIO
import pandas
from pandas import DataFrame
import requests
############Wrangling Raw Data######################
data_site = r'http://www.calvin.edu/~stob/data/Berkeley.csv'
#get the csv file as an IO stream
admit_io = StringIO(requests.get(data_site).text)
#strip away nonsense from each line and get each line as a list of cells
admit_io = [line.strip('\n').strip('\r').split(',') for line in admit_io]
#get the field names from the first list
fieldnames = admit_io[0]
df=DataFrame(data=admit_io[1:], columns=fieldnames)
df['Freq'] = df['Freq'].astype(float64)
print df
We are given a dataset that contains the Gender across 6 departments, with each department classifying applicants based on their admission status as either 'Admitted' or 'Rejected'.
Let us find out the total ratio of male admissions across all departments as against total ratio of femael admissions across all departments Indeed, when we look at the table above, We could consider this a frequency graph.
In [12]:
#total ratio of male admissions across all departments
print "total ratio of male admissions across all departments: {}".\
format(df[df['Gender']=='Male'][df.Admit=='Admitted'].Freq.sum()/df[df['Gender']=='Male'].Freq.sum())
#total ratio of female admissions across all departments
print "total ratio of female admissions across all departments: {}".\
format(df[df['Gender']=='Female'][df.Admit=='Admitted'].Freq.sum()/df[df['Gender']=='Female'].Freq.sum())
Indeed, we can see that male admissions across all departments are roughly at ~44.5% and that of female are ~30%. But students were not applying to all departments and so taking an admission ratio across all departments may not be the correct approach. Finding the admission ratio per department per gender might be a better approach.
Since prospective students had applied for a specific department, we cannot say anything about the gender biasing unless we investigate department wise what the figures are actually saying.
Inshort, we want to find out admission ratio across each department based on the admission ratios of male per department and admission ratios of female per department.
For that, lets first create a table each for male and female students admitted in 1976.
In [14]:
df_m = df[df['Gender'] == 'Male']
df_f = df[df['Gender'] == 'Female']
#Get the admissions ratios
#each list comprehension is a Series. Append them together in one Series
res_m=reduce(lambda x,y: x.append(y), [df_m[df_m.Admit=='Admitted']\
[df_m.Dept==dept].Freq/df_m[df_m['Dept']==dept].Freq.sum() \
for dept in df_m['Dept'][::2]])
#join the Admission Ratio col to main DF
df_m = df_m[df_m.Admit=='Admitted'].join\
(pandas.Series(res_m, name='Male Admission Ratio', dtype=float64))
#Get the admissions ratios
#each list comprehension is a Series. Append them together in one Series
res_f=reduce(lambda x,y: x.append(y), [df_f[df_f.Admit=='Admitted']\
[df_f.Dept==dept].Freq/df_f[df_f['Dept']==dept].Freq.sum() \
for dept in df_f['Dept'][::2]])
#join the Admission Ratio col to main DF
df_f = df_f[df_f.Admit=='Admitted'].join\
(pandas.Series(res_f, name='Female Admission Ratio', dtype=float64))
#male admission ratios table
print "male admission ratios table:\n"
print df_m
print
#female admission ratios table
print "female admission ratios table:\n"
print df_f
In [19]:
bar(df_m['Freq'])
In [ ]:
In [24]:
##########Getting Descriptive Statistics#################
#take Population mean Admission ratio across all departments
male_admits = (df_m['Male Admission Ratio']).tolist()
female_admits = (df_f['Female Admission Ratio']).tolist()
#take the population size of above admission metric
male_sample_size = len(df_m['Male Admission Ratio'])
female_sample_size = len(df_m['Male Admission Ratio'])
#take admission ratios mean
male_mean = df_m['Male Admission Ratio'].mean()
female_mean = df_f['Female Admission Ratio'].mean()
print "Mean of departmental admission ratios across all dept. for males:\n%s"\
% male_mean
print
print "Mean of departmental admission ratios across all dept. for females:\n%s"\
% female_mean
In [34]:
%cd C:\\Users\\Asus s\\Desktop\\AKULs Files\\principles_of_computing\\probability_combinatorics
%pwd()
from py_variance_std import se_pooled_t, df_independent_sample, critical_t,calc_t_independent_sample, t_percentile
In [46]:
#Take Pooled STD DEV of Independent Male Female Admission Ratios
pooled_se_admit = se_pooled_t(male_admits, female_admits)
#Take Independent DF
independent_pop_sample_df = \
df_independent_sample(male_sample_size, female_sample_size)
#Two sided Critical T Value
critical_t_val = critical_t(95, independent_pop_sample_df, 0)
#T score
t_score = calc_t_independent_sample(male_mean, female_mean, pooled_se_admit)
t_per = t_percentile(t_score, independent_pop_sample_df)
print "Is T Percentile significant?: %s < 0.05 => %s" %(t_per, t_per < 0.05)
print "Is T score significant?: %s > %s => %s" %(t_score,critical_t_val,t_score > critical_t_val)
marginal_err = critical_t_val * pooled_se_admit
print r"For a 95 percentile C.I. of Female Admissions Ratio mean of %s is between %s and %s" \
%(female_mean, female_mean-marginal_err, female_mean+marginal_err)
In [ ]: