notebook.community

Edit and run

Following NY Times article on Gaps in Earnings Stand Out in Release of College Data I decided to do some exploration myself

Download data



In [ ]:

    
%%bash
cd ~/Downloads
wget https://s3.amazonaws.com/ed-college-choice-public/CollegeScorecard_Raw_Data.zip
unzip CollegeScorecard_Raw_Data.zip

lets explore



In [206]:

    
!ls ~/Downloads/CollegeScorecard_Raw_Data









    



CollegeScorecardDataDictionary-09-12-2015.csv
CollegeScorecardDataDictionary-09-12-2015.pdf
Crosswalk_ZIP
Crosswalk_ZIP.zip
Data_File_Cohort_Map
FullDataDocumentation.pdf
MERGED1996_PP.csv
MERGED1997_PP.csv
MERGED1998_PP.csv
MERGED1999_PP.csv
MERGED2000_PP.csv
MERGED2001_PP.csv
MERGED2002_PP.csv
MERGED2003_PP.csv
MERGED2004_PP.csv
MERGED2005_PP.csv
MERGED2006_PP.csv
MERGED2007_PP.csv
MERGED2008_PP.csv
MERGED2009_PP.csv
MERGED2010_PP.csv
MERGED2011_PP.csv
MERGED2012_PP.csv
MERGED2013_PP.csv
data_dictionary.yaml

for some reason 2011 was the last year for which there is earning information



In [207]:

    
import pandas as pd
df = pd.read_csv('~/Downloads/CollegeScorecard_Raw_Data/MERGED2011_PP.csv', na_values=['PrivacySuppressed'])

the number of schools covered



In [208]:

    
len(df)









    Out[208]:





7675

for each there are lots of columns to read



In [209]:

    
len(df.columns)









    Out[209]:





1729

but there is a dictionary just for exploring what each column is



In [210]:

    
ddict = pd.read_csv('~/Downloads/CollegeScorecard_Raw_Data/CollegeScorecardDataDictionary-09-12-2015.csv')

the columns are grouped



In [211]:

    
ddict['dev-category'].unique()









    Out[211]:





array(['root', 'school', nan, 'admissions', 'academics', 'student', 'cost',
       'aid', 'completion', 'repayment', 'earnings'], dtype=object)

there are many columns in the earning category



In [212]:

    
pd.options.display.max_colwidth = 87
ddict[ddict['dev-category'] == 'earnings'].set_index('VARIABLE NAME')['NAME OF DATA ELEMENT']









    Out[212]:





VARIABLE NAME
count_ed                                                                     Count of students in the earnings cohort
count_nwne_p10                                   Number of students not working and not enrolled 10 years after entry
count_wne_p10                                        Number of students working and not enrolled 10 years after entry
mn_earn_wne_p10                               Mean earnings of students working and not enrolled 10 years after entry
md_earn_wne_p10                             Median earnings of students working and not enrolled 10 years after entry
pct10_earn_wne_p10              10th percentile of earnings of students working and not enrolled 10 years after entry
pct25_earn_wne_p10              25th percentile of earnings of students working and not enrolled 10 years after entry
pct75_earn_wne_p10              75th percentile of earnings of students working and not enrolled 10 years after entry
pct90_earn_wne_p10              90th percentile of earnings of students working and not enrolled 10 years after entry
sd_earn_wne_p10                Standard deviation of earnings of students working and not enrolled 10 years after ...
count_wne_inc1_p10             Number of students working and not enrolled 10 years after entry in the lowest inco...
count_wne_inc2_p10             Number of students working and not enrolled 10 years after entry in the middle inco...
count_wne_inc3_p10             Number of students working and not enrolled 10 years after entry in the highest inc...
count_wne_indep0_inc1_p10      Number of dependent students working and not enrolled 10 years after entry in the l...
count_wne_indep0_p10                       Number of dependent students working and not enrolled 10 years after entry
count_wne_indep1_p10                     Number of independent students working and not enrolled 10 years after entry
count_wne_male0_p10                           Number of female students working and not enrolled 10 years after entry
count_wne_male1_p10                             Number of male students working and not enrolled 10 years after entry
gt_25k_p10                      Share of students earning over $25,000/year (threshold earnings) 10 years after entry
mn_earn_wne_inc1_p10           Mean earnings of students working and not enrolled 10 years after entry in the lowe...
mn_earn_wne_inc2_p10           Mean earnings of students working and not enrolled 10 years after entry in the midd...
mn_earn_wne_inc3_p10           Mean earnings of students working and not enrolled 10 years after entry in the high...
mn_earn_wne_indep0_inc1_p10    Mean earnings of dependent students working and not enrolled 10 years after entry i...
mn_earn_wne_indep0_p10              Mean earnings of dependent students working and not enrolled 10 years after entry
mn_earn_wne_indep1_p10            Mean earnings of independent students working and not enrolled 10 years after entry
mn_earn_wne_male0_p10                  Mean earnings of female students working and not enrolled 10 years after entry
mn_earn_wne_male1_p10                    Mean earnings of male students working and not enrolled 10 years after entry
count_nwne_p6                                     Number of students not working and not enrolled 6 years after entry
count_wne_p6                                          Number of students working and not enrolled 6 years after entry
mn_earn_wne_p6                                 Mean earnings of students working and not enrolled 6 years after entry
                                                                        ...                                          
count_wne_male1_p6                               Number of male students working and not enrolled 6 years after entry
gt_25k_p6                        Share of students earning over $25,000/year (threshold earnings) 6 years after entry
mn_earn_wne_inc1_p6            Mean earnings of students working and not enrolled 6 years after entry in the lowes...
mn_earn_wne_inc2_p6            Mean earnings of students working and not enrolled 6 years after entry in the middl...
mn_earn_wne_inc3_p6            Mean earnings of students working and not enrolled 6 years after entry in the highe...
mn_earn_wne_indep0_inc1_p6     Mean earnings of dependent students working and not enrolled 6 years after entry in...
mn_earn_wne_indep0_p6                Mean earnings of dependent students working and not enrolled 6 years after entry
mn_earn_wne_indep1_p6              Mean earnings of independent students working and not enrolled 6 years after entry
mn_earn_wne_male0_p6                    Mean earnings of female students working and not enrolled 6 years after entry
mn_earn_wne_male1_p6                      Mean earnings of male students working and not enrolled 6 years after entry
count_nwne_p7                                     Number of students not working and not enrolled 7 years after entry
count_wne_p7                                          Number of students working and not enrolled 7 years after entry
mn_earn_wne_p7                                 Mean earnings of students working and not enrolled 7 years after entry
sd_earn_wne_p7                 Standard deviation of earnings of students working and not enrolled 7 years after e...
gt_25k_p7                        Share of students earning over $25,000/year (threshold earnings) 7 years after entry
count_nwne_p8                                     Number of students not working and not enrolled 8 years after entry
count_wne_p8                                          Number of students working and not enrolled 8 years after entry
mn_earn_wne_p8                                 Mean earnings of students working and not enrolled 8 years after entry
md_earn_wne_p8                               Median earnings of students working and not enrolled 8 years after entry
pct10_earn_wne_p8                10th percentile of earnings of students working and not enrolled 8 years after entry
pct25_earn_wne_p8                25th percentile of earnings of students working and not enrolled 8 years after entry
pct75_earn_wne_p8                75th percentile of earnings of students working and not enrolled 8 years after entry
pct90_earn_wne_p8                90th percentile of earnings of students working and not enrolled 8 years after entry
sd_earn_wne_p8                 Standard deviation of earnings of students working and not enrolled 8 years after e...
gt_25k_p8                        Share of students earning over $25,000/year (threshold earnings) 8 years after entry
count_nwne_p9                                     Number of students not working and not enrolled 9 years after entry
count_wne_p9                                          Number of students working and not enrolled 9 years after entry
mn_earn_wne_p9                                 Mean earnings of students working and not enrolled 9 years after entry
sd_earn_wne_p9                 Standard deviation of earnings of students working and not enrolled 9 years after e...
gt_25k_p9                        Share of students earning over $25,000/year (threshold earnings) 9 years after entry
Name: NAME OF DATA ELEMENT, dtype: object

and here is the field for male earning 10 years after finishing school. Yes. females are male0



In [213]:

    
ddict[ddict['VARIABLE NAME'] == 'mn_earn_wne_male1_p10'].values









    Out[213]:





array([[ 'Mean earnings of male students working and not enrolled 10 years after entry',
        nan, 'earnings', '10_yrs_after_entry.mean_earnings.male_students',
        'mn_earn_wne_male1_p10', 'integer', nan, nan, nan, nan, 'Treasury',
        nan]], dtype=object)

compute the difference between women and men



In [214]:

    
df['diffp10'] = df.mn_earn_wne_male1_p10 - df.mn_earn_wne_male0_p10



In [215]:

    
df = df.sort(columns=['diffp10'], ascending=False)



In [216]:

    
for name in ['Massachusetts Institute of Technology', 'Stanford University']:
    print name, df[df.INSTNM == name].diffp10.values[0]









    



Massachusetts Institute of Technology 58100.0
Stanford University 56400.0

results



In [220]:

    
df[['INSTNM','diffp10','mn_earn_wne_male0_p10','mn_earn_wne_male1_p10']].set_index('INSTNM').head(30)









    Out[220]:






  
    
      
      diffp10
      mn_earn_wne_male0_p10
      mn_earn_wne_male1_p10
    
    
      INSTNM
      
      
      
    
  
  
    
      University of Medicine and Dentistry of New Jersey
      113100
      120200
      233300
    
    
      The University of Texas Health Science Center at Houston
      110400
      83500
      193900
    
    
      Upstate Medical University
      96100
      108100
      204200
    
    
      Rosalind Franklin University of Medicine and Science
      94900
      137100
      232000
    
    
      SUNY Downstate Medical Center
      81200
      136800
      218000
    
    
      University of Nebraska Medical Center
      79000
      72300
      151300
    
    
      Medical University of South Carolina
      69100
      76100
      145200
    
    
      Oregon Health & Science University
      68500
      70500
      139000
    
    
      University of Texas Southwestern Medical Center
      67300
      107300
      174600
    
    
      Georgia Health Sciences University
      63800
      64400
      128200
    
    
      Philadelphia College of Osteopathic Medicine
      62400
      143100
      205500
    
    
      Western University of Health Sciences
      59700
      108400
      168100
    
    
      University of California-San Francisco
      58700
      119400
      178100
    
    
      Midwestern University-Downers Grove
      58200
      111700
      169900
    
    
      Midwestern University-Glendale
      58200
      111700
      169900
    
    
      Massachusetts Institute of Technology
      58100
      93700
      151800
    
    
      Stanford University
      56400
      94700
      151100
    
    
      Brigham Young University-Provo
      55000
      29500
      84500
    
    
      Harvard University
      54600
      111200
      165800
    
    
      The University of Texas Health Science Center at San Antonio
      54200
      80900
      135100
    
    
      University of Maryland-Baltimore
      53800
      84200
      138000
    
    
      Middlebury College
      51900
      57300
      109200
    
    
      Monterey Institute of International Studies
      51900
      57300
      109200
    
    
      Amherst College
      49900
      62800
      112700
    
    
      University of Pennsylvania
      49000
      92000
      141000
    
    
      Princeton University
      47700
      89700
      137400
    
    
      Thomas Jefferson University
      41100
      80100
      121200
    
    
      Loma Linda University
      40000
      68500
      108500
    
    
      University of Chicago
      39700
      78500
      118200
    
    
      Golden Gate University-San Francisco
      39300
      63300
      102600

	diffp10	mn_earn_wne_male0_p10	mn_earn_wne_male1_p10
INSTNM
University of Medicine and Dentistry of New Jersey	113100	120200	233300
The University of Texas Health Science Center at Houston	110400	83500	193900
Upstate Medical University	96100	108100	204200
Rosalind Franklin University of Medicine and Science	94900	137100	232000
SUNY Downstate Medical Center	81200	136800	218000
University of Nebraska Medical Center	79000	72300	151300
Medical University of South Carolina	69100	76100	145200
Oregon Health & Science University	68500	70500	139000
University of Texas Southwestern Medical Center	67300	107300	174600
Georgia Health Sciences University	63800	64400	128200
Philadelphia College of Osteopathic Medicine	62400	143100	205500
Western University of Health Sciences	59700	108400	168100
University of California-San Francisco	58700	119400	178100
Midwestern University-Downers Grove	58200	111700	169900
Midwestern University-Glendale	58200	111700	169900
Massachusetts Institute of Technology	58100	93700	151800
Stanford University	56400	94700	151100
Brigham Young University-Provo	55000	29500	84500
Harvard University	54600	111200	165800
The University of Texas Health Science Center at San Antonio	54200	80900	135100
University of Maryland-Baltimore	53800	84200	138000
Middlebury College	51900	57300	109200
Monterey Institute of International Studies	51900	57300	109200
Amherst College	49900	62800	112700
University of Pennsylvania	49000	92000	141000
Princeton University	47700	89700	137400
Thomas Jefferson University	41100	80100	121200
Loma Linda University	40000	68500	108500
University of Chicago	39700	78500	118200
Golden Gate University-San Francisco	39300	63300	102600