Following NY Times article on Gaps in Earnings Stand Out in Release of College Data I decided to do some exploration myself
Download data
In [ ]:
%%bash
cd ~/Downloads
wget https://s3.amazonaws.com/ed-college-choice-public/CollegeScorecard_Raw_Data.zip
unzip CollegeScorecard_Raw_Data.zip
lets explore
In [206]:
!ls ~/Downloads/CollegeScorecard_Raw_Data
for some reason 2011 was the last year for which there is earning information
In [207]:
import pandas as pd
df = pd.read_csv('~/Downloads/CollegeScorecard_Raw_Data/MERGED2011_PP.csv', na_values=['PrivacySuppressed'])
the number of schools covered
In [208]:
len(df)
Out[208]:
for each there are lots of columns to read
In [209]:
len(df.columns)
Out[209]:
but there is a dictionary just for exploring what each column is
In [210]:
ddict = pd.read_csv('~/Downloads/CollegeScorecard_Raw_Data/CollegeScorecardDataDictionary-09-12-2015.csv')
the columns are grouped
In [211]:
ddict['dev-category'].unique()
Out[211]:
there are many columns in the earning category
In [212]:
pd.options.display.max_colwidth = 87
ddict[ddict['dev-category'] == 'earnings'].set_index('VARIABLE NAME')['NAME OF DATA ELEMENT']
Out[212]:
and here is the field for male earning 10 years after finishing school. Yes. females are male0
In [213]:
ddict[ddict['VARIABLE NAME'] == 'mn_earn_wne_male1_p10'].values
Out[213]:
compute the difference between women and men
In [214]:
df['diffp10'] = df.mn_earn_wne_male1_p10 - df.mn_earn_wne_male0_p10
In [215]:
df = df.sort(columns=['diffp10'], ascending=False)
In [216]:
for name in ['Massachusetts Institute of Technology', 'Stanford University']:
print name, df[df.INSTNM == name].diffp10.values[0]
In [220]:
df[['INSTNM','diffp10','mn_earn_wne_male0_p10','mn_earn_wne_male1_p10']].set_index('INSTNM').head(30)
Out[220]: