Following NY Times article on Gaps in Earnings Stand Out in Release of College Data I decided to do some exploration myself

Download data


In [ ]:
%%bash
cd ~/Downloads
wget https://s3.amazonaws.com/ed-college-choice-public/CollegeScorecard_Raw_Data.zip
unzip CollegeScorecard_Raw_Data.zip

lets explore


In [206]:
!ls ~/Downloads/CollegeScorecard_Raw_Data


CollegeScorecardDataDictionary-09-12-2015.csv
CollegeScorecardDataDictionary-09-12-2015.pdf
Crosswalk_ZIP
Crosswalk_ZIP.zip
Data_File_Cohort_Map
FullDataDocumentation.pdf
MERGED1996_PP.csv
MERGED1997_PP.csv
MERGED1998_PP.csv
MERGED1999_PP.csv
MERGED2000_PP.csv
MERGED2001_PP.csv
MERGED2002_PP.csv
MERGED2003_PP.csv
MERGED2004_PP.csv
MERGED2005_PP.csv
MERGED2006_PP.csv
MERGED2007_PP.csv
MERGED2008_PP.csv
MERGED2009_PP.csv
MERGED2010_PP.csv
MERGED2011_PP.csv
MERGED2012_PP.csv
MERGED2013_PP.csv
data_dictionary.yaml

for some reason 2011 was the last year for which there is earning information


In [207]:
import pandas as pd
df = pd.read_csv('~/Downloads/CollegeScorecard_Raw_Data/MERGED2011_PP.csv', na_values=['PrivacySuppressed'])

the number of schools covered


In [208]:
len(df)


Out[208]:
7675

for each there are lots of columns to read


In [209]:
len(df.columns)


Out[209]:
1729

but there is a dictionary just for exploring what each column is


In [210]:
ddict = pd.read_csv('~/Downloads/CollegeScorecard_Raw_Data/CollegeScorecardDataDictionary-09-12-2015.csv')

the columns are grouped


In [211]:
ddict['dev-category'].unique()


Out[211]:
array(['root', 'school', nan, 'admissions', 'academics', 'student', 'cost',
       'aid', 'completion', 'repayment', 'earnings'], dtype=object)

there are many columns in the earning category


In [212]:
pd.options.display.max_colwidth = 87
ddict[ddict['dev-category'] == 'earnings'].set_index('VARIABLE NAME')['NAME OF DATA ELEMENT']


Out[212]:
VARIABLE NAME
count_ed                                                                     Count of students in the earnings cohort
count_nwne_p10                                   Number of students not working and not enrolled 10 years after entry
count_wne_p10                                        Number of students working and not enrolled 10 years after entry
mn_earn_wne_p10                               Mean earnings of students working and not enrolled 10 years after entry
md_earn_wne_p10                             Median earnings of students working and not enrolled 10 years after entry
pct10_earn_wne_p10              10th percentile of earnings of students working and not enrolled 10 years after entry
pct25_earn_wne_p10              25th percentile of earnings of students working and not enrolled 10 years after entry
pct75_earn_wne_p10              75th percentile of earnings of students working and not enrolled 10 years after entry
pct90_earn_wne_p10              90th percentile of earnings of students working and not enrolled 10 years after entry
sd_earn_wne_p10                Standard deviation of earnings of students working and not enrolled 10 years after ...
count_wne_inc1_p10             Number of students working and not enrolled 10 years after entry in the lowest inco...
count_wne_inc2_p10             Number of students working and not enrolled 10 years after entry in the middle inco...
count_wne_inc3_p10             Number of students working and not enrolled 10 years after entry in the highest inc...
count_wne_indep0_inc1_p10      Number of dependent students working and not enrolled 10 years after entry in the l...
count_wne_indep0_p10                       Number of dependent students working and not enrolled 10 years after entry
count_wne_indep1_p10                     Number of independent students working and not enrolled 10 years after entry
count_wne_male0_p10                           Number of female students working and not enrolled 10 years after entry
count_wne_male1_p10                             Number of male students working and not enrolled 10 years after entry
gt_25k_p10                      Share of students earning over $25,000/year (threshold earnings) 10 years after entry
mn_earn_wne_inc1_p10           Mean earnings of students working and not enrolled 10 years after entry in the lowe...
mn_earn_wne_inc2_p10           Mean earnings of students working and not enrolled 10 years after entry in the midd...
mn_earn_wne_inc3_p10           Mean earnings of students working and not enrolled 10 years after entry in the high...
mn_earn_wne_indep0_inc1_p10    Mean earnings of dependent students working and not enrolled 10 years after entry i...
mn_earn_wne_indep0_p10              Mean earnings of dependent students working and not enrolled 10 years after entry
mn_earn_wne_indep1_p10            Mean earnings of independent students working and not enrolled 10 years after entry
mn_earn_wne_male0_p10                  Mean earnings of female students working and not enrolled 10 years after entry
mn_earn_wne_male1_p10                    Mean earnings of male students working and not enrolled 10 years after entry
count_nwne_p6                                     Number of students not working and not enrolled 6 years after entry
count_wne_p6                                          Number of students working and not enrolled 6 years after entry
mn_earn_wne_p6                                 Mean earnings of students working and not enrolled 6 years after entry
                                                                        ...                                          
count_wne_male1_p6                               Number of male students working and not enrolled 6 years after entry
gt_25k_p6                        Share of students earning over $25,000/year (threshold earnings) 6 years after entry
mn_earn_wne_inc1_p6            Mean earnings of students working and not enrolled 6 years after entry in the lowes...
mn_earn_wne_inc2_p6            Mean earnings of students working and not enrolled 6 years after entry in the middl...
mn_earn_wne_inc3_p6            Mean earnings of students working and not enrolled 6 years after entry in the highe...
mn_earn_wne_indep0_inc1_p6     Mean earnings of dependent students working and not enrolled 6 years after entry in...
mn_earn_wne_indep0_p6                Mean earnings of dependent students working and not enrolled 6 years after entry
mn_earn_wne_indep1_p6              Mean earnings of independent students working and not enrolled 6 years after entry
mn_earn_wne_male0_p6                    Mean earnings of female students working and not enrolled 6 years after entry
mn_earn_wne_male1_p6                      Mean earnings of male students working and not enrolled 6 years after entry
count_nwne_p7                                     Number of students not working and not enrolled 7 years after entry
count_wne_p7                                          Number of students working and not enrolled 7 years after entry
mn_earn_wne_p7                                 Mean earnings of students working and not enrolled 7 years after entry
sd_earn_wne_p7                 Standard deviation of earnings of students working and not enrolled 7 years after e...
gt_25k_p7                        Share of students earning over $25,000/year (threshold earnings) 7 years after entry
count_nwne_p8                                     Number of students not working and not enrolled 8 years after entry
count_wne_p8                                          Number of students working and not enrolled 8 years after entry
mn_earn_wne_p8                                 Mean earnings of students working and not enrolled 8 years after entry
md_earn_wne_p8                               Median earnings of students working and not enrolled 8 years after entry
pct10_earn_wne_p8                10th percentile of earnings of students working and not enrolled 8 years after entry
pct25_earn_wne_p8                25th percentile of earnings of students working and not enrolled 8 years after entry
pct75_earn_wne_p8                75th percentile of earnings of students working and not enrolled 8 years after entry
pct90_earn_wne_p8                90th percentile of earnings of students working and not enrolled 8 years after entry
sd_earn_wne_p8                 Standard deviation of earnings of students working and not enrolled 8 years after e...
gt_25k_p8                        Share of students earning over $25,000/year (threshold earnings) 8 years after entry
count_nwne_p9                                     Number of students not working and not enrolled 9 years after entry
count_wne_p9                                          Number of students working and not enrolled 9 years after entry
mn_earn_wne_p9                                 Mean earnings of students working and not enrolled 9 years after entry
sd_earn_wne_p9                 Standard deviation of earnings of students working and not enrolled 9 years after e...
gt_25k_p9                        Share of students earning over $25,000/year (threshold earnings) 9 years after entry
Name: NAME OF DATA ELEMENT, dtype: object

and here is the field for male earning 10 years after finishing school. Yes. females are male0


In [213]:
ddict[ddict['VARIABLE NAME'] == 'mn_earn_wne_male1_p10'].values


Out[213]:
array([[ 'Mean earnings of male students working and not enrolled 10 years after entry',
        nan, 'earnings', '10_yrs_after_entry.mean_earnings.male_students',
        'mn_earn_wne_male1_p10', 'integer', nan, nan, nan, nan, 'Treasury',
        nan]], dtype=object)

compute the difference between women and men


In [214]:
df['diffp10'] = df.mn_earn_wne_male1_p10 - df.mn_earn_wne_male0_p10

In [215]:
df = df.sort(columns=['diffp10'], ascending=False)

In [216]:
for name in ['Massachusetts Institute of Technology', 'Stanford University']:
    print name, df[df.INSTNM == name].diffp10.values[0]


Massachusetts Institute of Technology 58100.0
Stanford University 56400.0

results


In [220]:
df[['INSTNM','diffp10','mn_earn_wne_male0_p10','mn_earn_wne_male1_p10']].set_index('INSTNM').head(30)


Out[220]:
diffp10 mn_earn_wne_male0_p10 mn_earn_wne_male1_p10
INSTNM
University of Medicine and Dentistry of New Jersey 113100 120200 233300
The University of Texas Health Science Center at Houston 110400 83500 193900
Upstate Medical University 96100 108100 204200
Rosalind Franklin University of Medicine and Science 94900 137100 232000
SUNY Downstate Medical Center 81200 136800 218000
University of Nebraska Medical Center 79000 72300 151300
Medical University of South Carolina 69100 76100 145200
Oregon Health & Science University 68500 70500 139000
University of Texas Southwestern Medical Center 67300 107300 174600
Georgia Health Sciences University 63800 64400 128200
Philadelphia College of Osteopathic Medicine 62400 143100 205500
Western University of Health Sciences 59700 108400 168100
University of California-San Francisco 58700 119400 178100
Midwestern University-Downers Grove 58200 111700 169900
Midwestern University-Glendale 58200 111700 169900
Massachusetts Institute of Technology 58100 93700 151800
Stanford University 56400 94700 151100
Brigham Young University-Provo 55000 29500 84500
Harvard University 54600 111200 165800
The University of Texas Health Science Center at San Antonio 54200 80900 135100
University of Maryland-Baltimore 53800 84200 138000
Middlebury College 51900 57300 109200
Monterey Institute of International Studies 51900 57300 109200
Amherst College 49900 62800 112700
University of Pennsylvania 49000 92000 141000
Princeton University 47700 89700 137400
Thomas Jefferson University 41100 80100 121200
Loma Linda University 40000 68500 108500
University of Chicago 39700 78500 118200
Golden Gate University-San Francisco 39300 63300 102600