COLLEGE UPWARD MOBILITY

Every year journalists and economists (USA today, the Wall Street journal, Forbes etc.) rank the best undergraduate colleges in the USA. While the top colleges are associated with good employment outcomes, they are expensive, highly selective and largely exclusive. For the majority of students in America, attending college is a means to increase economic mobility, that is, to be able earn more than a high school student. For many of these students attending top tier ivy leagues is not an option because of tuition costs, family income and their SAT scores. The primary goal of this initiative is to develop a predictive model that can help low-income students decide where to go to college to increase their economic mobility.

------------------------------------------------------------------

Importing and cleaning the data from College Scorecard

------------------------------------------------------------------



In [1]:

    
import pandas as pd
import json
import urllib2
from pandas.io.json import json_normalize
import re
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm



In [2]:

    
# I imported (~45) variables of interest to the csv file from the data dictionatry provided by College score card
# https://collegescorecard.ed.gov/data/documentation/

Allvariables = pd.read_csv('allVariables.csv')

# convert df to a list of variables in the format required by the API query
#API query format is year.category.developer-friendly name
#format year.category.developer-friendly name 

#adding a . to category
Allvariables['dev-category'] = Allvariables['dev-category'].astype(str) +'.' 

#variables that are associated with the school and don't change with year 
school_variables = Allvariables[Allvariables['dev-category'].isin(['school.'])]

#variables that change with year
year_variables = Allvariables[~Allvariables['dev-category'].isin(['school.'])]

#concatenating category and developer-friendly name and converting it to a list
year_var_name = list(year_variables[['dev-category','developer-friendly name']].astype(str).sum(axis=1))
school_var_name = list(school_variables[['dev-category','developer-friendly name']].astype(str).sum(axis=1))
#remove the root variable
year_var_name.remove('root.id')
year_var_name.remove('root.ope8_id')
year_var_name.remove('root.ope6_id')



In [3]:

    
# Reading the academic variables separately
# https://collegescorecard.ed.gov/data/documentation/

Academics = pd.read_csv('academics.csv')

# convert df to a list of variables in the format required by the API query
#API query format is year.category.developer-friendly name
#format year.category.developer-friendly name 

#adding a . to category
Academics['dev-category'] = Academics['dev-category'].astype(str) +'.' 

#concatenating category and developer-friendly name and converting it to a list
academic_var_name = list(Academics[['dev-category','developer-friendly name']].astype(str).sum(axis=1))



In [4]:

    
#Only download all the  data from the api if not already downloaded 
#once downloaded the data is saved as a csv format
if os.path.exists("year_data.csv"):
    allyears_df = pd.read_csv("year_data.csv")
    print("Year data already extracted from API")
else:   
    # for data that changes by year, loop through the years and download all the data
    #number of years
    years = range(2004,2016)
    years.remove(2011) 
    #number of pages per year
    n = int(7593/100)+1 #number of total entries per year / number of enteries per page: 100 is the max. 
    #empty list of data frames for all the data
    alldata =[]
    for year in years:
        #build the query from var_name
        QueryAPI= str(year)+'.'+(','+str(year)+'.').join(year_var_name)
        #create empty list to store all page dfs for a year 
        allpages =[]
        print "Downloading year {}".format(year)
        #loop to get all pages (n) for a year (API only allows a max of 100 enteries per page)
        for i in xrange(0, n):
            try:
                print "Downloading Page {}".format(i)
                api ='https://api.data.gov/ed/collegescorecard/v1/schools?fields=id,{}&_per_page=100&page={}&api_key=7EfsgcFY3JfMxQJ7tAW0Cjqamb77iOHPbpdBXTP6'.format(QueryAPI, str(i))
                req = urllib2.Request(api)
                response = urllib2.urlopen(req)
                response_read = response.read()
                json_data=json.loads(response_read)
                #store the results of each api call (pages) to a data frame
                df = json_normalize(json_data, 'results', meta = 'metadata')
                #removing the year from the column title so that we can concatenate columns across all years
                replace = str(year)+'.' 
                df.columns=[re.sub(replace,'', s) for s in df.columns] 
                #make a list of dataframes
                allpages.append(df)
            except:
                pass  
        #combine all df in the list to a single df per year    
        all_pages_df = pd.concat(allpages)
        #add a year column
        all_pages_df['year'] =year
        #store all dfs per year in a list 
        alldata.append(all_pages_df)
    #combine all df in the list to a single df for all years     
    allyears_df = pd.concat(alldata)
    #write the df to a csv file
    allyears_df.to_csv('year_data.csv')









    



Downloading year 2004
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2005
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2006
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2007
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2008
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2009
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2010
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2012
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2013
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2014
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75
Downloading year 2015
Downloading Page 0
Downloading Page 1
Downloading Page 2
Downloading Page 3
Downloading Page 4
Downloading Page 5
Downloading Page 6
Downloading Page 7
Downloading Page 8
Downloading Page 9
Downloading Page 10
Downloading Page 11
Downloading Page 12
Downloading Page 13
Downloading Page 14
Downloading Page 15
Downloading Page 16
Downloading Page 17
Downloading Page 18
Downloading Page 19
Downloading Page 20
Downloading Page 21
Downloading Page 22
Downloading Page 23
Downloading Page 24
Downloading Page 25
Downloading Page 26
Downloading Page 27
Downloading Page 28
Downloading Page 29
Downloading Page 30
Downloading Page 31
Downloading Page 32
Downloading Page 33
Downloading Page 34
Downloading Page 35
Downloading Page 36
Downloading Page 37
Downloading Page 38
Downloading Page 39
Downloading Page 40
Downloading Page 41
Downloading Page 42
Downloading Page 43
Downloading Page 44
Downloading Page 45
Downloading Page 46
Downloading Page 47
Downloading Page 48
Downloading Page 49
Downloading Page 50
Downloading Page 51
Downloading Page 52
Downloading Page 53
Downloading Page 54
Downloading Page 55
Downloading Page 56
Downloading Page 57
Downloading Page 58
Downloading Page 59
Downloading Page 60
Downloading Page 61
Downloading Page 62
Downloading Page 63
Downloading Page 64
Downloading Page 65
Downloading Page 66
Downloading Page 67
Downloading Page 68
Downloading Page 69
Downloading Page 70
Downloading Page 71
Downloading Page 72
Downloading Page 73
Downloading Page 74
Downloading Page 75



In [5]:

    
allyears_df.dtypes
allyears_df = allyears_df.apply(pd.to_numeric, errors='coerce')



In [6]:

    
if os.path.exists("school_data.csv"):
    allschools_df = pd.read_csv("school_data.csv")
    print("School data already extracted from API")
else:
    n = int(7593/100)+1 #number of total entries per year / number of enteries per page: 100 is the max. 
    #empty list of data frames for all the data
    alldata =[]
    QueryAPI= (',').join(school_var_name)
    #loop to get all pages (n) (API only allows a max of 100 enteries per page)
    for i in xrange(0, n):
        try:
            print "Downloading Page {}".format(i)
            api ='https://api.data.gov/ed/collegescorecard/v1/schools?fields=school.name,id,ope8_id,ope6_id,school.region_id,location.lat,location.lon,{}&_per_page=100&page={}&api_key=7EfsgcFY3JfMxQJ7tAW0Cjqamb77iOHPbpdBXTP6'.format(QueryAPI, str(i))
            req = urllib2.Request(api)
            response = urllib2.urlopen(req)
            response_read = response.read()
            json_data=json.loads(response_read)
            #store the results of each api call (pages) to a data frame
            df = json_normalize(json_data, 'results', meta = 'metadata')
            #removing the year from the column title so that we can concatenate columns across all years
            replace = str(year)+'.' 
            df.columns=[re.sub(replace,'', s) for s in df.columns] 
            #make a list of dataframes
            alldata.append(df)
        except:
            pass  
    #combine all df in the list to a single df for all years     
    allschools_df = pd.concat(alldata)
    #write the df to a csv file
    allschools_df.to_csv('school_data.csv')









    



School data already extracted from API



In [7]:

    
#Only download all the  data from the api if not already downloaded 
#once downloaded the data is saved as a csv format

if os.path.exists("academic_data.csv"):
    allyears_academic_df = pd.read_csv("academic_data.csv")
    print("Academic data already extracted from API")
else:   
    # for data that changes by year, loop through the years and download all the data
    #number of years
    years = range(1997,2016)
    years.remove(2007)
    #years.remove(2013)
    #number of pages per year
    n = int(7593/100)+1 #number of total entries per year / number of enteries per page: 100 is the max. 
    #empty list of data frames for all the data
    alldata =[]
    for year in years:
        #build the query from var_name
        QueryAPI= str(year)+'.'+(','+str(year)+'.').join(academic_var_name)
        #create empty list to store all page dfs for a year 
        allpages =[]
        print "Downloading year {}".format(year)
        #loop to get all pages (n) for a year (API only allows a max of 100 enteries per page)
        for i in xrange(0, n):
            try:
                print "Downloading Page {}".format(i)
                api ='https://api.data.gov/ed/collegescorecard/v1/schools?fields=school.name,id,{}&_per_page=100&page={}&api_key=7EfsgcFY3JfMxQJ7tAW0Cjqamb77iOHPbpdBXTP6'.format(QueryAPI, str(i))
                req = urllib2.Request(api)
                response = urllib2.urlopen(req)
                response_read = response.read()
                json_data=json.loads(response_read)
                #store the results of each api call (pages) to a data frame
                df = json_normalize(json_data, 'results', meta = 'metadata')
                #removing the year from the column title so that we can concatenate columns across all years
                replace = str(year)+'.' 
                df.columns=[re.sub(replace,'', s) for s in df.columns] 
                #make a list of dataframes
                allpages.append(df)
            except:
                pass
        try:
            #combine all df in the list to a single df per year    
            all_pages_df = pd.concat(allpages)
            #add a year column
            all_pages_df['year'] =year
            #store all dfs per year in a list 
            alldata.append(all_pages_df)
        except:
            pass
    #combine all df in the list to a single df for all years     
    allyears_academic_df = pd.concat(alldata)
    #write the df to a csv file
    allyears_academic_df.to_csv('academic_data.csv')









    



Academic data already extracted from API



In [8]:

    
### download state data 
n = int(7593/100)+1
if os.path.exists("states.csv"):
    state_df = pd.read_csv("states.csv")
    print("State data already extracted from API")
else: 
    alldata=[]
    for i in xrange(0, n):
        try:
            print "Downloading Page {}".format(i)
            api ='https://api.data.gov/ed/collegescorecard/v1/schools?fields=school.name,id,school.state&_per_page=100&page={}&api_key=7EfsgcFY3JfMxQJ7tAW0Cjqamb77iOHPbpdBXTP6'.format(str(i))
            req = urllib2.Request(api)
            response = urllib2.urlopen(req)
            response_read = response.read()
            json_data=json.loads(response_read)
            #store the results of each api call (pages) to a data frame
            df = json_normalize(json_data, 'results', meta = 'metadata')
            #make a list of dataframes
            alldata.append(df)
        except:
            pass  

    # combine all df in the list to a single df for all years     
    state_df = pd.concat(alldata)
    state_df.to_csv('states.csv')









    



State data already extracted from API

Missing data



In [9]:

    
%matplotlib inline 
missing = allyears_df.groupby('year').count()
missing = missing/7593 #dividing it by the total to get a proportion.
plt.figure(figsize=(17,7))
plt.xticks(fontsize=10)
plt.yticks(fontsize=15)
plt.title("Missing student data from 1997 to 2015", fontsize=20, ha='center', x=0.5, y=1.04)
plt.suptitle('lighter hues (close to zero) indicate a higher proportion of missing data', fontsize=12, ha='center', x=0.43, y=0.91)
sns.heatmap(missing)
plt.show()## Missing data



In [10]:

    
%matplotlib inline 
missing = allyears_academic_df.groupby('year').count()
missing = missing/7593 #dividing it by the total to get a proportion.
plt.figure(figsize=(15,6))
plt.xticks(fontsize=12)
plt.yticks(fontsize=15)
plt.title("Missing academic data from 2008 to 2015", fontsize=20, ha='center', x=0.5, y=1.04)
plt.suptitle('lighter hues (close to zero) indicate a higher proportion of missing data', fontsize=12, ha='center', x=0.43, y=0.91)
sns.heatmap(missing)
plt.show()## Missing data

The college scorecard data is missing for a number of field in many years. Since 2013 is the most comprehensive data collection year. I will focus on the data collected in 2013. The most recent year of census data is available for 2005. I will impute the 2013 data with census data from 2005 .



In [11]:

    
#replace blank values (coded as -3) with NAs
allschools_df=allschools_df.replace(-3.0, np.NaN)

Carnegie classifications for colleges



In [12]:

    
#All these classifications were coded using integers with associated categorical data provided in the data dictionary. 
#Here I change the integers to classifications using the dictionary (imported as a csv)
## Carnegie school type classification
## Using the data dictionary 
carnegie = pd.read_csv('Carnegie_notations.csv', header=None)
carnegie_dict =dict(carnegie.values)
allschools_df["school.carnegie_basic"].replace(carnegie_dict, inplace=True)
#print(allschools_df["school.carnegie_basic"].value_counts(dropna=False))
# almost a thrid of the colleges are not classified



In [13]:

    
## Carnegie undergrad classification
carnegie_ug = pd.read_csv('Carnegie_undergrad.csv', header=None)
carnegie_ug_dict =dict(carnegie_ug.values)
allschools_df["school.carnegie_undergrad"].replace(carnegie_ug_dict, inplace=True)
#print(allschools_df["school.carnegie_undergrad"].value_counts(dropna=False))
# almost a thrid of the colleges are not classified 
#remove all the graduate professional only colleges and two-year colleges



In [14]:

    
## Carnegie size classification
carnegie_sz = pd.read_csv('Carnegie_size.csv', header=None)
carnegie_sz_dict =dict(carnegie_sz.values)
allschools_df["school.carnegie_size_setting"].replace(carnegie_sz_dict, inplace=True)
#print(allschools_df["school.carnegie_size_setting"].value_counts(dropna=False))
# almost a thrid of the colleges are not classified



In [15]:

    
##  regional classification
region = pd.read_csv('regionID.csv', header=None)
region_dict =dict(region.values)
allschools_df["school.region_id"].replace(region_dict, inplace=True)
#print(allschools_df["school.region_id"].value_counts(dropna=False))



In [16]:

    
# this columnn is non-existant so remove from df 
allschools_df.drop('school.degree_urbanization', axis=1, inplace=True)



In [17]:

    
## Locale classification
locale = pd.read_csv('urbanization.csv', header=None)
locale_dict =dict(locale.values)
allschools_df["school.locale"].replace(locale_dict, inplace=True)

#print(allschools_df["school.locale"].value_counts(dropna=False))
# almost a thrid of the colleges are not classified

The college scorecard data has median family income and income post graduation but these are median values across all students. To get an idea of the students who come from lower income families and climb the economic ladder, I am using data from the http://www.equality-of-opportunity.org/ project which has caluclated metric of mobility using non-identified tax information from individuals.

The metrics they calculate are:

Median parent household income (par_median)
Median child individual earnings in 2014 (rounded to the nearest 100 dollars) (k_median)
Fraction of parents in the Bottom 20% of the income distribution (par_q1)
Fraction of parents in the Top 1% of the income distribution (par_top1pc)
Percent of children who reach the Top 20% of the income distribution among children with parents in the Bottom 20% of the income distribution (kq5_cond_parq1)
Percent of children who reach the Top 1% of the income distribution among children with parents in the Bottom 20% of the income distribution (ktop1pc_cond_parq1)
Percent of students who have parents in the Bottom 20% of the income distribution and reach the Top 20% of the income distribution (Mobility Rate)
Upper-Tail Mobility Rate: Percent of students who have parents in the Bottom 20% of the income distribution and reach the Top 1% of the income distribution
Change in % of Parents from the Bottom 20% of the income distribution between the 1980 and 1991 cohorts (trend_parq1)
Change in % of Parents from the Bottom 40% of the income distribution between the 1980 and 1991 cohorts (trend_bottom40: )

Merging datasets



In [18]:

    
### Merging state data with school data -- downloaded separately
allschools_df_state=pd.merge(allschools_df, state_df[['id','school.state']], left_on='id', right_on='id', how ='inner')



In [19]:

    
### Merging school data with mobility data from the equal opportunity project
mobility_EOP = pd.read_csv("http://www.equality-of-opportunity.org/data/college/mrc_table1.csv")
#combing dataframes by either names or id -- neither is consistant amongst the two dataframes
allschools_EOP_df= pd.merge(allschools_df_state, mobility_EOP, left_on='super_opeid', right_on='super_opeid')



In [20]:

    
#merge all_schools with the 2013 data for schools
# 2013 most complete dataset 
df_2013 = allyears_df[(allyears_df['year']==2013)] 
#impute missing census data in 2013 from 2005
df_2005 = allyears_df[(allyears_df['year']==2005)]
#impute census columns
impute_cols = ['student.demographics.share_white.home_ZIP', \
'student.demographics.share_black.home_ZIP', \
'student.demographics.share_asian.home_ZIP', \
'student.demographics.share_hispanic.home_ZIP', \
'student.demographics.share_bachelors_degree_age25.home_ZIP', \
'student.demographics.share_professional_degree_age25.home_ZIP', \
'student.demographics.share_born_US.home_ZIP', \
'student.demographics.poverty_rate', \
'student.demographics.unemployment', \
'student.demographics.median_hh_income_log']
# selecting the census columns from 2005
df_2005_cenus = df_2005[impute_cols]
df_2005_cenus['id'] = df_2005['id']

# removing the census columns from 2013
df_2013.drop(impute_cols, axis=1, inplace=True)

# merging the 2013 dataset with the 2015 census data on id
df_2013_imputed = pd.merge(df_2013, df_2005_cenus, left_on='id', right_on='id')









    



/Users/taniajogesh/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/taniajogesh/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [21]:

    
#main dataset
df_2013_full = pd.merge(df_2013_imputed, allschools_EOP_df, left_on='id', right_on='id')

Renaming columns to more intuitive names



In [22]:

    
colnames = pd.read_csv('colnames.csv')
colnames_dict = dict(zip(colnames['old'], colnames['new']))
df_2013_full.rename(columns=colnames_dict, inplace=True)

Calculating indicies



In [23]:

    
#minority_serving
df_2013_full['minority_serving']= \
df_2013_full['minority_serving_hispanic']+ \
df_2013_full['minority_serving_predominantly_black']+ \
df_2013_full['minority_serving_historically_black']+ \
df_2013_full['minority_serving_tribal']+ \
df_2013_full['minority_serving_aanipi']+ \
df_2013_full['minority_serving_annh']
#df_2013_full['minority_serving'][df_2013_full.minority_serving>1]=1

# % underrep moniorities  
df_2013_full['underrep_minorities'] = \
df_2013_full['prop_black'] +\
df_2013_full['prop_hispanic'] + \
df_2013_full['prop_pacific_islander'] + \
df_2013_full['prop_native_american'] \

#calculate how much more students earn on average after attending the college
df_2013_full['debt_income_ratio'] = \
df_2013_full['median_debt_completers_overall']/df_2013_full['median_earnings_in_10_yrs']

# % underrep moniorities in neighbourhood  
df_2013_full['underrep_minorities_census'] = \
df_2013_full['prop_black_homes_in_neighborhood'] +\
df_2013_full['prop_hispanic_homes_in_neighborhood']

----------------------------------

Exploratory Data Aanalysis

----------------------------------

Check normality and outliers in the data with histograms



In [24]:

    
data = df_2013_full[['prop_bottom_income',
                     'prop_top_1percent_income',
                     'prop_bottom_income_to_top',
                     'prop_bottom_income_to_top_1percent',
                     'mobility_rate']]
data_sqrt=np.sqrt(data)
g = sns.PairGrid(data)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
plt.show()

All of these are non-normal with a bias towards lower values. A sqrt transformation should approximate a normal distribution



In [25]:

    
data = df_2013_full[['debt_income_ratio',
                     'tuition_in_state',
                     'cost_per_academic_year',
                     'median_earnings_in_10_yrs',
                     'number_unemployed',
                     'loan_principal',
                     'median_debt_completers_overall',
                     'retention_rate_four_year_full_time',
                     'avg_family_income']].dropna()
                     

g = sns.PairGrid(data)
g = g.map_diag(plt.hist, edgecolor="w")
g = g.map_offdiag(plt.scatter, edgecolor="w", s=40)
#data['debt_income_ratio']

Tuition is bimodal, cost per academic year is a better metric of cost. The number unemployed has high values for some universities.



In [26]:

    
data['number_unemployed'].hist()
plt.xlabel('number unemployed postgraduation')
plt.ylabel('frequency')
plt.show()



In [27]:

    
df_2013_full[df_2013_full.number_unemployed >4000][['school_name','number_unemployed']].head(10)









    Out[27]:






  
    
      
      school_name
      number_unemployed
    
  
  
    
      464
      NaN
      14774.0
    
    
      483
      NaN
      4837.0
    
    
      500
      NaN
      6005.0
    
    
      501
      NaN
      6005.0
    
    
      502
      NaN
      6005.0
    
    
      503
      NaN
      6005.0
    
    
      504
      NaN
      6005.0
    
    
      505
      NaN
      6005.0
    
    
      506
      NaN
      6005.0
    
    
      507
      NaN
      6005.0

University of Phoenix and DeVry are the outliers contributing to this! These are both online, large universities with minimal selectivity and pending lawsuits for poor educational quality. These data points are not really outliers but indicative of these institutions.



In [28]:

    
data = df_2013_full[['underrep_minorities',
                     'prop_international_student',
                     'prop_part_time',
                     'prop_grad_students',
                     'faculty_salary',
                     'prop_fulltime_faculty',
                     'pell_grant_rate',
                     'federal_loan_rate',
                     'prop_firstgeneration',
                     'prop_firstgeneration_parents_highschool',
                    ]].dropna()
g = sns.PairGrid(data)
g = g.map_diag(plt.hist, edgecolor="w")
g = g.map_offdiag(plt.scatter, edgecolor="w", s=40)

proportion of minorities, international students and grad students is not normal and biased towards lower proportions. A large number of institutions have all full time faculty, making this proportion also non-normal. Proportion data often do not conform to a normal distribution and are often arcsine square root transformed or analyzed with other distributions (e.g. binomial). No obvious outliers here.



In [29]:

    
## some values appear to be close to 0 but doesn't look like there are any salaries below $1000. 
## these are probably real and not mistakes in the data
df_2013_full[df_2013_full.faculty_salary <2000][['name_broad','faculty_salary']].head(10)









    Out[29]:






  
    
      
      name_broad
      faculty_salary
    
  
  
    
      1385
      Ottawa University
      1257.0
    
    
      1527
      University Of Phoenix
      1641.0



In [30]:

    
data = df_2013_full[['prop_bachelors_degree_homes_in_neighborhood',
                     'poverty_rate_in_neighborhood',
                     'unemployment_in_neighborhood',
                     'median_income_in_neighborhood'
                    ]].dropna()
g = sns.PairGrid(data)
g = g.map_diag(plt.hist, edgecolor="w")
g = g.map_offdiag(plt.scatter, edgecolor="w", s=40)

median_income_in_neighborhood is categorical. The data dictionary decribes this as continous. I'm not sure what 10 and 11 refer to here so I will refrain from using this. Everything else looks okay.

Visual trends in the data



In [86]:

    
sns.set(style="ticks", color_codes=True, font_scale=2 )
sns.lmplot(x="avg_family_income", y="sat_scores_average_overall", data=df_2013_full, size =8)
plt.title('Average family income of attending students \n and college selectivity (SAT scores)', fontsize = 24)
plt.xlabel('Average family income', fontsize = 16)
plt.ylabel('Average SAT scores (overall)', fontsize = 16)
plt.show()

Selective schools admit a student body that is wealthy

What is remarkable, albiet not completely unexpected, is that schools that are more selective (those requiring higher SAT scores) generally admit a student body that is wealthy. Data on family income and SAT scores show that kids from poorer families tend do do worse on the SATs (https://economix.blogs.nytimes.com/2009/08/27/sat-scores-and-family-income/). What this means for upward mobility is that top tier, selective colleges, are not well-suited to increase the upward mobility of lower income families.



In [33]:

    
## test for linear relationship
data = df_2013_full[['avg_family_income', 'sat_scores_average_overall']].dropna()
predictors = data['avg_family_income']
response = data['sat_scores_average_overall']
#predictors = sm.add_constant(predictors)
mod = sm.OLS(response, predictors)
res = mod.fit()
print(res.summary())









    



                                OLS Regression Results                                
======================================================================================
Dep. Variable:     sat_scores_average_overall   R-squared:                       0.958
Model:                                    OLS   Adj. R-squared:                  0.958
Method:                         Least Squares   F-statistic:                 2.607e+04
Date:                        Wed, 20 Dec 2017   Prob (F-statistic):               0.00
Time:                                15:10:21   Log-Likelihood:                -7876.9
No. Observations:                        1156   AIC:                         1.576e+04
Df Residuals:                            1155   BIC:                         1.576e+04
Df Model:                                   1                                         
Covariance Type:                    nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
avg_family_income     0.0136    8.4e-05    161.451      0.000         0.013     0.014
==============================================================================
Omnibus:                       45.480   Durbin-Watson:                   1.421
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               65.894
Skew:                          -0.362   Prob(JB):                     4.91e-15
Kurtosis:                       3.918   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

A 10,000 dollar increase in family income increases SAT scores by on average 13.6 points and this relationship is statistically significant

I ran regressions for all SAT scores so that missing scores can be imputed based on average family income at the college. This is important as a lot of SAT score data is missing from the dataset



In [34]:

    
## test for linear relationship math
data = df_2013_full[['avg_family_income', 'sat_scores_midpoint_math']].dropna()
predictors = data['avg_family_income']
response = data['sat_scores_midpoint_math']
#predictors = sm.add_constant(predictors)
mod = sm.OLS(response, predictors)
res = mod.fit()
print(res.summary())
## 380.9 + 0.002*family income









    



                               OLS Regression Results                               
====================================================================================
Dep. Variable:     sat_scores_midpoint_math   R-squared:                       0.956
Model:                                  OLS   Adj. R-squared:                  0.956
Method:                       Least Squares   F-statistic:                 2.358e+04
Date:                      Wed, 20 Dec 2017   Prob (F-statistic):               0.00
Time:                              15:10:21   Log-Likelihood:                -6647.6
No. Observations:                      1082   AIC:                         1.330e+04
Df Residuals:                          1081   BIC:                         1.330e+04
Df Model:                                 1                                         
Covariance Type:                  nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
avg_family_income     0.0068   4.41e-05    153.559      0.000         0.007     0.007
==============================================================================
Omnibus:                       30.716   Durbin-Watson:                   1.394
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               50.136
Skew:                          -0.235   Prob(JB):                     1.30e-11
Kurtosis:                       3.944   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [35]:

    
## test for linear relationship writing 
data = df_2013_full[['avg_family_income', 'sat_scores_midpoint_writing']].dropna()
predictors = data['avg_family_income']
response = data['sat_scores_midpoint_writing']
#predictors = sm.add_constant(predictors)
mod = sm.OLS(response, predictors)
res = mod.fit()
print(res.summary())
## 348 + 0.0022*family income









    



                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:     sat_scores_midpoint_writing   R-squared:                       0.961
Model:                                     OLS   Adj. R-squared:                  0.961
Method:                          Least Squares   F-statistic:                 1.565e+04
Date:                         Wed, 20 Dec 2017   Prob (F-statistic):               0.00
Time:                                 15:10:21   Log-Likelihood:                -3850.1
No. Observations:                          635   AIC:                             7702.
Df Residuals:                              634   BIC:                             7707.
Df Model:                                    1                                         
Covariance Type:                     nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
avg_family_income     0.0064    5.1e-05    125.095      0.000         0.006     0.006
==============================================================================
Omnibus:                       20.162   Durbin-Watson:                   1.576
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               25.179
Skew:                          -0.335   Prob(JB):                     3.41e-06
Kurtosis:                       3.709   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [36]:

    
## test for linear relationship - reading 
data = df_2013_full[['avg_family_income', 'sat_scores_midpoint_critical_reading']].dropna()
predictors = data['avg_family_income']
response = data['sat_scores_midpoint_critical_reading']
#predictors = sm.add_constant(predictors)
mod = sm.OLS(response, predictors)
res = mod.fit()
print(res.summary())
## 373.85 + 0.002*family income









    



                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     sat_scores_midpoint_critical_reading   R-squared:                       0.957
Model:                                              OLS   Adj. R-squared:                  0.957
Method:                                   Least Squares   F-statistic:                 2.382e+04
Date:                                  Wed, 20 Dec 2017   Prob (F-statistic):               0.00
Time:                                          15:10:21   Log-Likelihood:                -6547.2
No. Observations:                                  1071   AIC:                         1.310e+04
Df Residuals:                                      1070   BIC:                         1.310e+04
Df Model:                                             1                                         
Covariance Type:                              nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
avg_family_income     0.0066   4.29e-05    154.336      0.000         0.007     0.007
==============================================================================
Omnibus:                       36.847   Durbin-Watson:                   1.423
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               48.931
Skew:                          -0.353   Prob(JB):                     2.37e-11
Kurtosis:                       3.774   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [37]:

    
%matplotlib inline
sns.set(style="ticks", color_codes=True, font_scale=2 )
sns.lmplot(x="sat_scores_average_overall", y="mobility_rate", data=df_2013_full, size =8)
plt.title('College selectivity (SAT scores) and mobility rate', fontsize = 25)
plt.xlabel('Average SAT scores (overall)', fontsize = 16)
plt.ylabel('Mobility rate', fontsize = 16)
plt.show()

College selectivity (SAT scores) have a small positive effect on mobility rate. i.e more selective colleges have marginally higher mobility.



In [38]:

    
## test for linear relationships between these variables 
data = df_2013_full[['sat_scores_average_overall', 'mobility_rate']].dropna()
response = np.sqrt(data['mobility_rate'])
predictors = data['sat_scores_average_overall']
mod = sm.OLS(response, predictors)
res = mod.fit()
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          mobility_rate   R-squared:                       0.883
Model:                            OLS   Adj. R-squared:                  0.883
Method:                 Least Squares   F-statistic:                     9008.
Date:                Wed, 20 Dec 2017   Prob (F-statistic):               0.00
Time:                        15:10:21   Log-Likelihood:                -796.81
No. Observations:                1191   AIC:                             1596.
Df Residuals:                    1190   BIC:                             1601.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------
sat_scores_average_overall     0.0012   1.27e-05     94.911      0.000         0.001     0.001
==============================================================================
Omnibus:                      264.992   Durbin-Watson:                   1.474
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              675.359
Skew:                           1.169   Prob(JB):                    2.23e-147
Kurtosis:                       5.853   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [39]:

    
%matplotlib inline
sns.set(style="ticks", color_codes=True, font_scale=2 )
sns.lmplot(x="sat_scores_average_overall", y="prop_bottom_income", data=df_2013_full, size =8)
plt.title('College selectivity (SAT scores) and \n percent of low-income students', fontsize = 20)
plt.xlabel('Average SAT scores (overall)', fontsize = 16)
plt.ylabel('Percent of low-income students', fontsize = 16)
plt.show()

selective colleges don't incease upward mobility because they not take in a high proportion of low income students



In [40]:

    
%matplotlib inline
sns.set(style="ticks", color_codes=True, font_scale=2 )
sns.lmplot(x="sat_scores_average_overall", y="prop_bottom_income_to_top", data=df_2013_full, size =8)
plt.title('College selectivity (SAT scores) and \n upward mobility for admitted low-income students', fontsize = 20)
plt.xlabel('Average SAT scores (overall)', fontsize = 16)
plt.ylabel('Percent of low-income students achiving upward mobility', fontsize = 16)
plt.show()

However, for low income students who can get into more selective colleges, mobility is much higher.



In [41]:

    
data = df_2013_full[['sat_scores_average_overall', 'prop_bottom_income_to_top']].dropna()
response = np.sqrt(data['prop_bottom_income_to_top'])
predictors = data['sat_scores_average_overall']
mod = sm.OLS(response, predictors)
res = mod.fit()
print(res.summary())









    



                                OLS Regression Results                               
=====================================================================================
Dep. Variable:     prop_bottom_income_to_top   R-squared:                       0.957
Model:                                   OLS   Adj. R-squared:                  0.957
Method:                        Least Squares   F-statistic:                 2.652e+04
Date:                       Wed, 20 Dec 2017   Prob (F-statistic):               0.00
Time:                               15:10:22   Log-Likelihood:                -1783.4
No. Observations:                       1191   AIC:                             3569.
Df Residuals:                           1190   BIC:                             3574.
Df Model:                                  1                                         
Covariance Type:                   nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------
sat_scores_average_overall     0.0047   2.91e-05    162.857      0.000         0.005     0.005
==============================================================================
Omnibus:                       36.615   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               73.858
Skew:                          -0.177   Prob(JB):                     9.16e-17
Kurtosis:                       4.168   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Correlation heatmap to look for important relationships.



In [42]:

    
data = df_2013_full[['sat_scores_average_overall',
                     'prop_bottom_income',
                     'prop_top_1percent_income',
                     'prop_bottom_income_to_top',
                     'prop_bottom_income_to_top_1percent',
                     'mobility_rate',
                     'mobility_rate_to_top_1percent',
                     'debt_income_ratio',
                     'cost_per_academic_year',
                     'median_earnings_in_10_yrs',
                     'number_unemployed',
                     'loan_principal',
                     'median_debt_completers_overall',
                     'retention_rate_four_year_full_time',
                     'avg_family_income',
                     'underrep_minorities',
                     'prop_international_student',
                     'prop_part_time',
                     'prop_grad_students',
                     'faculty_salary',
                     'prop_fulltime_faculty',
                     'pell_grant_rate',
                     'federal_loan_rate',
                     'prop_firstgeneration',
                     'prop_firstgeneration_parents_highschool',
                     'prop_white_homes_in_neighborhood',
                     'underrep_minorities_census',
                     'prop_bachelors_degree_homes_in_neighborhood',
                     'poverty_rate_in_neighborhood',
                     'unemployment_in_neighborhood',
                     'median_income_in_neighborhood'
                    ]]

corr = data.corr()
corr.sort_values('sat_scores_average_overall', axis =1,inplace = True)
corr.sort_values('sat_scores_average_overall', axis =0, inplace = True, ascending = False)



In [43]:

    
## correlation with just mobility 
plot_data = corr['mobility_rate'].sort_values()
plot_data.drop(plot_data.tail(1).index,inplace=True)
plot_data.plot(kind='barh', figsize=(20, 15))
plt.title('Factors that correlate with mobility rate', fontsize = 40)
plt.xlabel('Mobilty rate', fontsize = 25)
plt.ylabel('Institutional characteristics', fontsize = 25)









    Out[43]:





<matplotlib.text.Text at 0x129f92bd0>

Diversity is important for college mobility rate

Upward mobility is related to the number of low-income students admitted and to diversity at the institution and where the institution is located. It is also interestingly associated with colleges that have more graduate students, international students, full time faculty that faculty that get paid more (i.e likely bigger universities).



In [44]:

    
#correlation heatmap
plt.figure(figsize=(20,15))
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title("")
plt.title('Correlation heatmap with all institutional characteristics', fontsize = 30)
sns.heatmap(corr)
plt.show()

Elite selective universities and economic mobility

The heat map depicts two data quandrants representing essentially two types of educational institutions: (1) Those that are selective (high averege SAT scores) which have students from top income brackets, high tuition and are in affluent white neighbourhoods with low diversity, and (2) The colleges with more minorities, more first generation students, lower family incomes and in lower income neighbourhoods.

Mobility is high for low income students who do make it to top selective and expensive universities (prop_bottom_income_to_top). However, these colleges take in very few students from the bottom income brackets (prop_bottom_income). Thus, elite universities are doing a poor job of facilitating upward mobility. These universities don't contribute to the American dream.



In [45]:

    
## test for linear relationships between these variables 
data = df_2013_full[['mobility_rate',
        'debt_income_ratio',
        'median_earnings_in_10_yrs',
        'loan_principal',
        'prop_firstgeneration',
        'underrep_minorities',
        'underrep_minorities_census',
        'poverty_rate_in_neighborhood',
        'prop_white_homes_in_neighborhood',
        'prop_international_student',
        'prop_grad_students',
        'faculty_salary'
        ]].dropna()
response = np.sqrt(data['mobility_rate'])
predictors = data[['debt_income_ratio',
        'median_earnings_in_10_yrs',
        'loan_principal',
        'prop_firstgeneration',
        'underrep_minorities',
        'underrep_minorities_census',
        'poverty_rate_in_neighborhood',
        'prop_white_homes_in_neighborhood',
        'prop_international_student',
        'prop_grad_students',
        'faculty_salary']]
mod = sm.OLS(response, predictors)
res = mod.fit()
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          mobility_rate   R-squared:                       0.953
Model:                            OLS   Adj. R-squared:                  0.952
Method:                 Least Squares   F-statistic:                     2141.
Date:                Wed, 20 Dec 2017   Prob (F-statistic):               0.00
Time:                        15:10:24   Log-Likelihood:                -256.37
No. Observations:                1178   AIC:                             534.7
Df Residuals:                    1167   BIC:                             590.5
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
====================================================================================================
                                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------
debt_income_ratio                   -0.3049      0.081     -3.782      0.000        -0.463    -0.147
median_earnings_in_10_yrs         8.717e-06    1.4e-06      6.240      0.000      5.98e-06  1.15e-05
loan_principal                    2.125e-05   2.91e-06      7.312      0.000      1.55e-05  2.69e-05
prop_firstgeneration                 1.0403      0.119      8.778      0.000         0.808     1.273
underrep_minorities                 -0.1342      0.102     -1.318      0.188        -0.334     0.066
underrep_minorities_census           0.0054      0.002      3.182      0.002         0.002     0.009
poverty_rate_in_neighborhood         0.0557      0.004     12.899      0.000         0.047     0.064
prop_white_homes_in_neighborhood    -0.0052      0.001     -5.516      0.000        -0.007    -0.003
prop_international_student           0.9024      0.222      4.069      0.000         0.467     1.337
prop_grad_students               -6.007e-06   3.81e-06     -1.576      0.115     -1.35e-05  1.47e-06
faculty_salary                    3.458e-05   6.32e-06      5.468      0.000      2.22e-05   4.7e-05
==============================================================================
Omnibus:                       60.374   Durbin-Watson:                   1.818
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              211.094
Skew:                           0.018   Prob(JB):                     1.45e-46
Kurtosis:                       5.073   Cond. No.                     1.23e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.23e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

There are very likely problems with multicollinearity with these data (many of the predictors are very correlated with each other e.g. underrep_minorities and underrep_minorities_census). The solution is dimension reduction (pca) followed by a regression of pca loadings against the response. Nevertheless, nearly all terms are significant at alpha = 0.05 (except the number of grad students and underrepreseted minorities). The effect sizes are large for debt_income_ratio, the number of first generation students and the number of international students. Schools that support a lot of first generation students increase mobility by almost 1%! Schools with an lower debt: income ratio also increase mobility by 0.3%.

The top debt incuring institutions



In [46]:

    
df_2013_full.sort_values('debt_income_ratio', ascending =False)[['name_broad','debt_income_ratio']].drop_duplicates().head(10)









    Out[46]:






  
    
      
      name_broad
      debt_income_ratio
    
  
  
    
      2221
      Florida Memorial University
      1.669884
    
    
      3039
      International Academy Of Design And Technology...
      1.588966
    
    
      459
      Central State University
      1.564590
    
    
      1425
      Wright Career College
      1.470588
    
    
      2422
      Saint Augustine's University
      1.469124
    
    
      391
      Shaw University
      1.430627
    
    
      1097
      Stillman College
      1.415441
    
    
      1041
      Paine College
      1.404494
    
    
      2169
      Grambling State University
      1.294607
    
    
      1107
      Philander Smith College
      1.291639

Institutions with top earnings

Many of of institutions with high median incomes focus on applied fields like the health sciences and technology. Surprisingly, Ivy leagues do not make the cut for the top 10 higher earning institutions.



In [47]:

    
df_2013_full.sort_values('median_earnings_in_10_yrs', ascending =False)[['name_broad','median_earnings_in_10_yrs']].drop_duplicates().head(10)









    Out[47]:






  
    
      
      name_broad
      median_earnings_in_10_yrs
    
  
  
    
      329
      Albany College Of Pharmacy And Health Sciences
      122600.0
    
    
      1449
      Louisiana State University System
      122500.0
    
    
      2540
      Saint Louis College Of Pharmacy
      120400.0
    
    
      2159
      MCPHS University
      112700.0
    
    
      1243
      Massachusetts Institute Of Technology
      94200.0
    
    
      2853
      University Of The Sciences In Philadelphia
      91800.0
    
    
      2474
      Babson College
      91400.0
    
    
      1244
      Harvard University
      90900.0
    
    
      1054
      Georgetown University
      90100.0
    
    
      2333
      Stevens Institute Of Technology
      87300.0

Institutions with highest mobility for those who can get in



In [48]:

    
# of the low income students, the proportion that make the top 20% (best bet)
df_2013_full.sort_values('prop_bottom_income_to_top', ascending =False)[['name_broad','prop_bottom_income_to_top']].drop_duplicates().head(10)









    Out[48]:






  
    
      
      name_broad
      prop_bottom_income_to_top
    
  
  
    
      2540
      Saint Louis College Of Pharmacy
      91.929314
    
    
      2159
      MCPHS University
      91.293564
    
    
      329
      Albany College Of Pharmacy And Health Sciences
      85.208885
    
    
      122
      California Maritime Academy
      84.974731
    
    
      2145
      Rose - Hulman Institute Of Technology
      78.208527
    
    
      1896
      Advanced Institute Of Hair Design
      77.881294
    
    
      2196
      Kettering University
      74.689377
    
    
      139
      Harvey Mudd College
      74.350357
    
    
      141
      Claremont Mckenna College
      68.282814
    
    
      2474
      Babson College
      68.151627

Institutions with highest mobility rankings



In [49]:

    
# mobility from bottom 20 to top 20 %
df_2013_full.sort_values('mobility_rate', ascending =False)[['name_broad','mobility_rate']].drop_duplicates().head(10)









    Out[49]:






  
    
      
      name_broad
      mobility_rate
    
  
  
    
      1953
      Vaughn College Of Aeronautics And Technology
      16.357975
    
    
      350
      CUNY Bernard M. Baruch College
      12.938586
    
    
      2311
      City College Of New York - CUNY
      11.723747
    
    
      1955
      CUNY Lehman College
      10.235138
    
    
      2575
      California State University, Los Angeles
      9.918455
    
    
      2544
      CUNY John Jay College Of Criminal Justice
      9.691438
    
    
      2159
      MCPHS University
      9.343507
    
    
      312
      Pace University
      8.432647
    
    
      1988
      State University Of New York At Stony Brook
      8.412747
    
    
      2309
      New York City College Of Technology Of The Cit...
      8.334076

Institutions with mobility to the top 1%



In [50]:

    
# mobility from bottom 20 to top 1 %
df_2013_full.sort_values('mobility_rate_to_top_1percent', ascending =False)[['name_broad','mobility_rate_to_top_1percent']].drop_duplicates().head(10)









    Out[50]:






  
    
      
      name_broad
      mobility_rate_to_top_1percent
    
  
  
    
      141
      Claremont Mckenna College
      1.249444
    
    
      2159
      MCPHS University
      0.963851
    
    
      2793
      Kiamichi Technology Center
      0.798517
    
    
      1099
      Huntingdon College
      0.778688
    
    
      97
      University Of California, Berkeley
      0.763982
    
    
      351
      Columbia University In The City Of New York
      0.750328
    
    
      714
      University Of Texas Of The Permian Basin
      0.737258
    
    
      1070
      California Institute Of Technology
      0.723216
    
    
      350
      CUNY Bernard M. Baruch College
      0.706509
    
    
      1391
      Maine Maritime Academy
      0.693706

Which college type is a good predictor of upward mobility?



In [51]:

    
df_2013_full[['carnegie_basic','mobility_rate']].groupby('carnegie_basic').median(). \
sort_values('mobility_rate').plot(kind='barh', figsize=(10, 20))
plt.title('Mobility rate and college type', fontsize = 25)
plt.xlabel('Mobility rate', fontsize = 16)
plt.ylabel("")









    Out[51]:





<matplotlib.text.Text at 0x12dec0e90>

larger, non-residential schools tend to increase economic mobility



In [52]:

    
df_2013_full[['carnegie_size_setting','mobility_rate']].groupby('carnegie_size_setting').median(). \
sort_values('mobility_rate').plot(kind='barh', figsize=(10, 10))
plt.title('Mobility rate and college type', fontsize = 25)
plt.xlabel('Mobility rate', fontsize = 16)
plt.ylabel("")









    Out[52]:





<matplotlib.text.Text at 0x12951b450>

Inclusive schools tend to increase economic mobility



In [53]:

    
df_2013_full[['carnegie_undergrad','mobility_rate']].groupby('carnegie_undergrad').median(). \
sort_values('mobility_rate').plot(kind='barh', figsize=(10, 8))
plt.title('Mobility rate and college type', fontsize = 25)
plt.xlabel('Mobility rate', fontsize = 16)
plt.ylabel("")









    Out[53]:





<matplotlib.text.Text at 0x13018f950>

Rural remote schools or those in large cities are better for economic mobility



In [54]:

    
df_2013_full[['locale','mobility_rate']].groupby('locale').median(). \
sort_values('mobility_rate').plot(kind='barh', figsize=(10, 6))









    Out[54]:





<matplotlib.axes._subplots.AxesSubplot at 0x1308cb090>

Diversity matters for upward mobility. Minority serving institutions enroll a large proportion of minority students and have developed strategies to help often-underprepared students succeed in college. These institutions make a huge difference for helping students at lower income levels rise up the economic ladder.



In [55]:

    
data=df_2013_full[['mobility_rate','minority_serving']].dropna()
MSIdict={0:'Not MSI', 1:'MSI 1 gr', 2:'MSI >1 gr'}
data['minority_serving'].replace(MSIdict, inplace = True)
sns.set(style="ticks", color_codes=True, font_scale=2 )
sns.boxplot(y="mobility_rate", x="minority_serving", data=data)
plt.xticks(rotation=45)
plt.title('Mobility rate and minority-serving institutions \n for single or multiple minorities (MSI)', fontsize = 25)
plt.ylabel('Mobility rate', fontsize = 16)
plt.xlabel("")
plt.show()

Minority serving institutions for a single underepresented minority increase mobility by 0.2 percent. Those that serve two or more minorities increase mobility by 1%



In [56]:

    
import statsmodels.formula.api as smf
mod= smf.ols('mobility_rate ~C(minority_serving)', data=data) #Specify C for Categorical
res = mod.fit()
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          mobility_rate   R-squared:                       0.147
Model:                            OLS   Adj. R-squared:                  0.146
Method:                 Least Squares   F-statistic:                     249.2
Date:                Wed, 20 Dec 2017   Prob (F-statistic):          1.43e-100
Time:                        15:10:27   Log-Likelihood:                -4623.9
No. Observations:                2895   AIC:                             9254.
Df Residuals:                    2892   BIC:                             9272.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
====================================================================================================
                                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------
Intercept                            2.8809      0.065     44.620      0.000         2.754     3.008
C(minority_serving)[T.MSI >1 gr]     1.1278      0.231      4.877      0.000         0.674     1.581
C(minority_serving)[T.Not MSI]      -1.3589      0.069    -19.747      0.000        -1.494    -1.224
==============================================================================
Omnibus:                     1569.253   Durbin-Watson:                   1.571
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            20716.297
Skew:                           2.279   Prob(JB):                         0.00
Kurtosis:                      15.287   Cond. No.                         14.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Students with the same family income tend to earn more after attending a minority-serving institution. MSIs also tend to be cheaper



In [57]:

    
MSIdict={0:'Not MSI', 1:'MSI 1 gr', 2:'MSI >1 gr'}
df_2013_full['minority_serving'].replace(MSIdict, inplace = True)
sns.set(style="ticks", color_codes=True, font_scale=2 )
sns.lmplot(x="avg_family_income", y="median_earnings_in_10_yrs", 
           data=df_2013_full, hue="minority_serving",fit_reg=False, size =8)
plt.show()

Traditional 4-year institutions engender the highest mobility. Two-year, junior colleges decrease mobility by 2.12% and colleges with shorter programs decrease mobility by almost 2.9%



In [58]:

    
data=df_2013_full[['mobility_rate','institutional_characteristics_level']].dropna()
Instdict={1:'4-year', 2:'2-year', 3:'less than 2-year'}
data['institutional_characteristics_level'].replace(Instdict, inplace = True)
sns.set(style="ticks", color_codes=True, font_scale=2 )
sns.boxplot(y="mobility_rate", x="institutional_characteristics_level", data=data)
plt.title('Mobility rate and college type', fontsize = 25)
plt.ylabel('Mobility rate', fontsize = 16)
plt.xlabel("")
plt.show()



In [59]:

    
mod= smf.ols('mobility_rate ~C(institutional_characteristics_level)', data=data) #Specify C for Categorical
res = mod.fit()
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          mobility_rate   R-squared:                       0.048
Model:                            OLS   Adj. R-squared:                  0.047
Method:                 Least Squares   F-statistic:                     77.85
Date:                Wed, 20 Dec 2017   Prob (F-statistic):           1.04e-33
Time:                        15:10:28   Log-Likelihood:                -5081.4
No. Observations:                3087   AIC:                         1.017e+04
Df Residuals:                    3084   BIC:                         1.019e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================================================================
                                                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------------------------------------------
Intercept                                                      1.6577      0.041     40.371      0.000         1.577     1.738
C(institutional_characteristics_level)[T.4-year]               0.2381      0.050      4.746      0.000         0.140     0.337
C(institutional_characteristics_level)[T.less than 2-year]    -0.7845      0.088     -8.871      0.000        -0.958    -0.611
==============================================================================
Omnibus:                     1850.285   Durbin-Watson:                   1.465
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            27398.804
Skew:                           2.575   Prob(JB):                         0.00
Kurtosis:                      16.656   Cond. No.                         5.04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What are the top colleges that increase economic mobility based on their selectivity (SAT scores)?

I binned SAT scores to three categories, low, medium and high here are the top institions per category. Here I also look at



In [60]:

    
sat_cat=('low','medium','high')
#sat_cat=('ignore','low1','low2','medium','high')
df_2013_full['sat_scores_bins'] =pd.cut(df_2013_full['sat_scores_average_overall'],3, labels =sat_cat)

Top schools for high-access, low selectivity (low SAT scores)



In [61]:

    
df_2013_full[(df_2013_full['sat_scores_bins'] =='low')]. \
sort_values('prop_bottom_income_to_top', ascending=False) \
[['name_broad','prop_bottom_income_to_top','median_earnings_in_10_yrs','sat_scores_average_overall']].head(10)









    Out[61]:






  
    
      
      name_broad
      prop_bottom_income_to_top
      median_earnings_in_10_yrs
      sat_scores_average_overall
    
  
  
    
      2449
      Nichols College
      52.332180
      48100.0
      940.0
    
    
      2409
      Saint Francis College
      49.203571
      44400.0
      940.0
    
    
      2318
      Dominican College Of Blauvelt
      48.201897
      45900.0
      899.0
    
    
      98
      California State University, East Bay
      43.979919
      50500.0
      910.0
    
    
      2269
      Albertus Magnus College
      42.551868
      51200.0
      830.0
    
    
      2389
      Utica College
      38.360474
      44100.0
      946.0
    
    
      433
      University Of Saint Joseph
      37.989376
      48800.0
      950.0
    
    
      2544
      CUNY John Jay College Of Criminal Justice
      35.684139
      43300.0
      950.0
    
    
      254
      Caldwell College
      34.991135
      41700.0
      949.0
    
    
      94
      California State University, Bakersfield
      32.759682
      46600.0
      901.0

Top schools for intermediate selectivity (median SAT scores)



In [62]:

    
df_2013_full[(df_2013_full['sat_scores_bins'] =='medium')]. \
sort_values('prop_bottom_income_to_top', ascending=False) \
[['name_broad','prop_bottom_income_to_top','median_earnings_in_10_yrs','sat_scores_average_overall']].head(10)









    Out[62]:






  
    
      
      name_broad
      prop_bottom_income_to_top
      median_earnings_in_10_yrs
      sat_scores_average_overall
    
  
  
    
      2540
      Saint Louis College Of Pharmacy
      91.929314
      120400.0
      1228.0
    
    
      2159
      MCPHS University
      91.293564
      112700.0
      1094.0
    
    
      329
      Albany College Of Pharmacy And Health Sciences
      85.208885
      122600.0
      1195.0
    
    
      122
      California Maritime Academy
      84.974731
      79000.0
      1080.0
    
    
      2196
      Kettering University
      74.689377
      78600.0
      1198.0
    
    
      270
      New Jersey Institute Of Technology
      63.790905
      65900.0
      1142.0
    
    
      2853
      University Of The Sciences In Philadelphia
      62.914597
      91800.0
      1165.0
    
    
      1246
      Massachusetts Maritime Academy
      61.262119
      79300.0
      1051.0
    
    
      1757
      University Of Mary Washington
      60.731060
      50500.0
      1110.0
    
    
      2473
      Bentley University
      60.582417
      80600.0
      1234.0

Top schools for high SAT scores



In [63]:

    
df_2013_full[(df_2013_full['sat_scores_bins'] =='high')]. \
sort_values('prop_bottom_income_to_top', ascending=False) \
[['name_broad','prop_bottom_income_to_top','median_earnings_in_10_yrs','sat_scores_average_overall']].head(10)









    Out[63]:






  
    
      
      name_broad
      prop_bottom_income_to_top
      median_earnings_in_10_yrs
      sat_scores_average_overall
    
  
  
    
      2145
      Rose - Hulman Institute Of Technology
      78.208527
      79200.0
      1310.0
    
    
      139
      Harvey Mudd College
      74.350357
      74200.0
      1483.0
    
    
      141
      Claremont Mckenna College
      68.282814
      75000.0
      1397.0
    
    
      2474
      Babson College
      68.151627
      91400.0
      1258.0
    
    
      1243
      Massachusetts Institute Of Technology
      66.528679
      94200.0
      1503.0
    
    
      1070
      California Institute Of Technology
      66.062187
      74200.0
      1534.0
    
    
      2308
      Princeton University
      65.865234
      80500.0
      1495.0
    
    
      361
      Rensselaer Polytechnic Institute
      64.620743
      77900.0
      1366.0
    
    
      2268
      Colorado School Of Mines
      64.041260
      82100.0
      1298.0
    
    
      2767
      Stanford University
      62.739227
      85700.0
      1466.0

Are there geographic patterns to economic mobility?



In [64]:

    
mean_mobility_state = df_2013_full[["state","mobility_rate"]].groupby("state").median()
mean_mobility_state.reset_index(inplace=True)
mean_cost_state = df_2013_full[["state","cost_per_academic_year"]].groupby("state").median()
mean_cost_state.reset_index(inplace=True)



In [65]:

    
import plotly 
plotly.tools.set_credentials_file(username='tjogesh', api_key='jQCndI6d5M1oYF4mXX7i')



In [66]:

    
scl = [[0.0, 'rgb(0, 51, 204)'],[0.2, 'rgb(51, 153, 255)'],[0.4, 'rgb(204, 204, 255)'],\
            [0.6, 'rgb(255, 204, 204)'],[0.8, 'rgb(255, 102, 0)'],[1.0, 'rgb(153, 0, 0)']]
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = mean_mobility_state['state'],
        z = mean_mobility_state['mobility_rate'].astype(float),
        locationmode = 'USA-states',
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Mobility ranking")
        ) ]

layout = dict(
        title = 'Average college mobility rate by state',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
plotly.plotly.iplot( fig, filename='mobility-cloropleth-map' )









    Out[66]:

Colleges in mid-western state have surprisingly low economic mobility whereas those with high immigrant populations (NY, TX, CA) have higher mobility on average



In [67]:

    
scl = [[0.0, 'rgb(0, 51, 204)'],[0.2, 'rgb(51, 153, 255)'],[0.4, 'rgb(204, 204, 255)'],\
            [0.6, 'rgb(255, 204, 204)'],[0.8, 'rgb(255, 102, 0)'],[1.0, 'rgb(153, 0, 0)']]
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = mean_cost_state['state'],
        z = mean_cost_state['cost_per_academic_year'].astype(float),
        locationmode = 'USA-states',
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Average cost per year")
        ) ]

layout = dict(
        title = 'Average cost of attending college by state',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
plotly.plotly.iplot( fig, filename='d3-cloropleth-map' )









    



High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~tjogesh/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'






    Out[67]:

Interestingly, colleges in mid-western states are more expensive on average! This might be why mobility rates are on average lower for institutions in these states

ACADEMICS



In [68]:

    
academic_df_2013 = allyears_academic_df[(allyears_academic_df['year']==2013)]
df_2013_full_academics = pd.merge(academic_df_2013, df_2013_full, left_on='id', right_on='id')
#academic_df_2013.head()



In [69]:

    
### correlation heatmap to look for important relationships. 
data = df_2013_full_academics[['mobility_rate',      
                     'academics.program_percentage.agriculture',
                     'academics.program_percentage.architecture',
                     'academics.program_percentage.biological',
                     'academics.program_percentage.business_marketing',
                     'academics.program_percentage.communication',
                     'academics.program_percentage.communications_technology',
                     'academics.program_percentage.computer',
                     'academics.program_percentage.construction',
                     'academics.program_percentage.education',
                     'academics.program_percentage.engineering',
                     'academics.program_percentage.engineering_technology',
                     'academics.program_percentage.english',
                     'academics.program_percentage.ethnic_cultural_gender',
                     'academics.program_percentage.family_consumer_science',
                     'academics.program_percentage.health',
                     'academics.program_percentage.history',
                     'academics.program_percentage.humanities',
                     'academics.program_percentage.language',
                     'academics.program_percentage.legal',
                     'academics.program_percentage.library',
                     'academics.program_percentage.mathematics',
                     'academics.program_percentage.mechanic_repair_technology',
                     'academics.program_percentage.military',
                     'academics.program_percentage.multidiscipline',
                     'academics.program_percentage.parks_recreation_fitness',
                     'academics.program_percentage.personal_culinary',
                     'academics.program_percentage.philosophy_religious',
                     'academics.program_percentage.physical_science',
                     'academics.program_percentage.precision_production',
                     'academics.program_percentage.psychology',
                     'academics.program_percentage.public_administration_social_service',
                     'academics.program_percentage.resources',
                     'academics.program_percentage.science_technology',
                     'academics.program_percentage.security_law_enforcement',
                     'academics.program_percentage.social_science',
                     'academics.program_percentage.theology_religious_vocation',
                     'academics.program_percentage.transportation',
                     'academics.program_percentage.visual_performing'
                    ]]

replace = "academics.program_percentage."
data.columns= [re.sub(replace,'', s) for s in data.columns] 

corr = data.corr()
#corr.sort_values('mobility_rate', axis =0, inplace = True, ascending = False)
plot_data = corr['mobility_rate'].sort_values()
plot_data.drop(plot_data.tail(1).index,inplace=True)
plot_data.plot(kind='barh', figsize=(20, 20))

#plt.figure(figsize=(20,15))
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Factors that correlate with mobility rate', fontsize = 40)
plt.xlabel('Mobilty rate', fontsize = 25)
plt.ylabel('Academic focus', fontsize = 25)
plt.show()## Missing data

----------------------------------

College Recommender

----------------------------------

Analysis Pipeline:

Classify colleges into unsupervised clusters (for different states) based on SAT scores (selectivity), cost, programs offered. Use K means clustering as it produces even clusters. I specify the number of clusters so that the data are split in approximately ~ 20 colleges per cluster
Determine the efficiency of the clustering using a supervised model to see how well, a ML algorithm can predict cluster classification based on the same features.
Use this classifier to predict in-state colleges that students may be able to attend, given the cost of the university, their SAT scores and location (State)
display the options with the mobility opportunities that the college provides.



In [70]:

    
#handle missing data by imputing medians
from sklearn.preprocessing import Imputer
impute = Imputer(strategy='median')



In [71]:

    
### Converting academic propotions to presence absence data i.e does the school have a program 
list_drop = ['Unnamed: 0','id','school.name','metadata','year'] 
academic_programs = academic_df_2013.drop(list_drop,axis = 1)
#s[s>1] = 0
academic_programs[academic_programs>0]=1
academic_programs.head()
academic_programs['id'] = academic_df_2013['id']
student_academics = pd.merge(academic_programs, df_2013_full, left_on='id', right_on='id')

#list(academic_programs.columns)



In [72]:

    
## simplyfying carnegie_undergrad to two classes 2-year ve 4-year institutions 
student_academics['undergrad_length']=student_academics['carnegie_undergrad'].str.rsplit(',', expand=True)[0]



In [73]:

    
cluster_features = student_academics[['school_name','state',
                     'academics.program_percentage.agriculture',
                     'academics.program_percentage.architecture',
                     'academics.program_percentage.biological',
                     'academics.program_percentage.business_marketing',
                     'academics.program_percentage.communication',
                     'academics.program_percentage.communications_technology',
                     'academics.program_percentage.computer',
                     'academics.program_percentage.construction',
                     'academics.program_percentage.education',
                     'academics.program_percentage.engineering',
                     'academics.program_percentage.engineering_technology',
                     'academics.program_percentage.english',
                     'academics.program_percentage.ethnic_cultural_gender',
                     'academics.program_percentage.family_consumer_science',
                     'academics.program_percentage.health',
                     'academics.program_percentage.history',
                     'academics.program_percentage.humanities',
                     'academics.program_percentage.language',
                     'academics.program_percentage.legal',
                     'academics.program_percentage.library',
                     'academics.program_percentage.mathematics',
                     'academics.program_percentage.mechanic_repair_technology',
                     'academics.program_percentage.military',
                     'academics.program_percentage.multidiscipline',
                     'academics.program_percentage.parks_recreation_fitness',
                     'academics.program_percentage.personal_culinary',
                     'academics.program_percentage.philosophy_religious',
                     'academics.program_percentage.physical_science',
                     'academics.program_percentage.precision_production',
                     'academics.program_percentage.psychology',
                     'academics.program_percentage.public_administration_social_service',
                     'academics.program_percentage.resources',
                     'academics.program_percentage.science_technology',
                     'academics.program_percentage.security_law_enforcement',
                     'academics.program_percentage.social_science',
                     'academics.program_percentage.theology_religious_vocation',
                     'academics.program_percentage.transportation',
                     'academics.program_percentage.visual_performing',
                     'sat_scores_midpoint_critical_reading',
                     'sat_scores_midpoint_math',
                     'sat_scores_midpoint_writing',
                     'cost_per_academic_year',
                     'prop_bottom_income_to_top',
                     'avg_family_income',
                     'median_earnings_in_10_yrs',
                     #'mobility_rate',                
                     'undergrad_length'                                        
                    ]]

imputing the SAT scores based on average family income of the college (slope = 0.006). The two variables are tightly correlated R2 >0.90



In [74]:

    
cluster_features.loc[cluster_features['sat_scores_midpoint_critical_reading'].isnull(), \
                     'sat_scores_midpoint_critical_reading'] = cluster_features['avg_family_income']*0.0066
cluster_features.loc[cluster_features['sat_scores_midpoint_math'].isnull(), \
                     'sat_scores_midpoint_math'] = cluster_features['avg_family_income']*0.0068
cluster_features.loc[cluster_features['sat_scores_midpoint_writing'].isnull(), \
                     'sat_scores_midpoint_writing'] = cluster_features['avg_family_income']*0.0064









    



/Users/taniajogesh/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py:477: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [75]:

    
#create unique list of names
location = cluster_features.state.unique()

#create a data frame dictionary to store your data frames
localDict = {loc : cluster_features for loc in location}

for key in localDict.keys():
    localDict[key] = cluster_features[:][cluster_features.state == key]

#localDict.pop('Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)',None) # only one sample
#remove states with less than 10 schools
[localDict.pop(key) for key in ['AK','PR']]
localDict.keys()









    Out[75]:





['WA',
 'ND',
 'DE',
 'DC',
 'WI',
 'WV',
 'HI',
 'FL',
 'WY',
 'NH',
 'NJ',
 'NM',
 'TX',
 'LA',
 'NC',
 'NE',
 'TN',
 'NY',
 'PA',
 'RI',
 'NV',
 'VA',
 'CO',
 'CA',
 'AL',
 'AR',
 'VT',
 'IL',
 'GA',
 'IN',
 'IA',
 'OK',
 'AZ',
 'ID',
 'CT',
 'ME',
 'MD',
 'MA',
 'OH',
 'UT',
 'MO',
 'MN',
 'MI',
 'KS',
 'MT',
 'MS',
 'SC',
 'KY',
 'OR',
 'SD']



In [76]:

    
cols_to_transform = ['undergrad_length','state']
cols_to_drop = ['school_name','median_earnings_in_10_yrs','avg_family_income','prop_bottom_income_to_top']
data = {}
dfs= {}
for key in localDict.keys():
    df = pd.get_dummies(localDict[key], columns = cols_to_transform )
    X = df.drop(cols_to_drop, axis =1).values
    X_new = impute.fit_transform(X)
    data.update({key:X_new})
    dfs.update({key:df})
    #print key, df.shape, X_new.shape



In [77]:

    
from sklearn.ensemble import RandomForestClassifier
def cluster_test(x, cluster):
    xtrain, xtest, ytrain, ytest = train_test_split(x,cluster, test_size=0.30)
    forest = RandomForestClassifier(n_estimators = 100) 
    forest = forest.fit(xtrain, ytrain)
    training_accuracy = forest.score(xtrain, ytrain)
    test_accuracy = forest.score(xtest, ytest)
    print("Accuracy on training data: {:2f}".format(training_accuracy))
    print("Accuracy on test data:     {:2f}".format(test_accuracy))
    return forest, test_accuracy



In [78]:

    
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
labels = {}
student_data_model = {}
model_accuracy = {}
cluster_number = {}
for key in data.keys():
    if len(data[key])< 10: #less than 10 schools per state
        model = KMeans(n_clusters=2) #~20 schools per cluster
        cluster_number.update({key:2})
        label=model.fit_predict(data[key])
        dfs[key]['cluster'] = label
        print("State: {}, No. of clusters: 2".format(key)) 
        forest, accuracy = cluster_test(data[key], label)
        student_data_model.update({key:forest})
        model_accuracy.update({key:accuracy})
    if len(data[key])> 40:
        model = KMeans(n_clusters=len(data[key])/20) #~20 schools per cluster
        cluster_number.update({key:len(data[key])/20})
        label=model.fit_predict(data[key])
        dfs[key]['cluster'] = label
        print("State: {}, No. of clusters: {}".format(key, len(data[key])/20)) 
        forest, accuracy = cluster_test(data[key], label)
        student_data_model.update({key:forest})
        model_accuracy.update({key:accuracy})
    else: 
        model = KMeans(n_clusters=len(data[key])/5)# between 10 and 40 schools per cluster
        cluster_number.update({key:len(data[key])/5})
        label=model.fit_predict(data[key])
        dfs[key]['cluster'] = label
        print("State: {}, No. of clusters: {}".format(key, len(data[key])/5)) 
        forest, accuracy = cluster_test(data[key], label)
        student_data_model.update({key:forest})
        model_accuracy.update({key:accuracy})









    



State: WA, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: DE, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: DE, No. of clusters: 1
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: DC, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: WI, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     0.809524
State: WV, No. of clusters: 4
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: HI, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.333333
State: HI, No. of clusters: 1
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: FL, No. of clusters: 7
Accuracy on training data: 1.000000
Accuracy on test data:     0.847826
State: WY, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: WY, No. of clusters: 1
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: NH, No. of clusters: 4
Accuracy on training data: 1.000000
Accuracy on test data:     0.428571
State: NJ, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: NM, No. of clusters: 6
Accuracy on training data: 1.000000
Accuracy on test data:     0.666667
State: TX, No. of clusters: 9
Accuracy on training data: 1.000000
Accuracy on test data:     0.824561
State: LA, No. of clusters: 7
Accuracy on training data: 1.000000
Accuracy on test data:     0.666667
State: NC, No. of clusters: 6
Accuracy on training data: 1.000000
Accuracy on test data:     0.789474
State: ND, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.750000
State: NE, No. of clusters: 4
Accuracy on training data: 1.000000
Accuracy on test data:     0.714286
State: TN, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     0.863636
State: NY, No. of clusters: 8
Accuracy on training data: 1.000000
Accuracy on test data:     0.764706
State: PA, No. of clusters: 9
Accuracy on training data: 1.000000
Accuracy on test data:     0.777778
State: RI, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: NV, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     0.666667
State: VA, No. of clusters: 4
Accuracy on training data: 1.000000
Accuracy on test data:     0.925926
State: CO, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.933333
State: CA, No. of clusters: 11
Accuracy on training data: 1.000000
Accuracy on test data:     0.652174
State: AL, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: AR, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.923077
State: VT, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.666667
State: VT, No. of clusters: 1
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: IL, No. of clusters: 6
Accuracy on training data: 1.000000
Accuracy on test data:     0.794872
State: GA, No. of clusters: 5
Accuracy on training data: 1.000000
Accuracy on test data:     0.677419
State: IN, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: IA, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: OK, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     0.750000
State: AZ, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.941176
State: ID, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: CT, No. of clusters: 7
Accuracy on training data: 1.000000
Accuracy on test data:     0.583333
State: ME, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     0.833333
State: MD, No. of clusters: 8
Accuracy on training data: 1.000000
Accuracy on test data:     0.583333
State: MA, No. of clusters: 4
Accuracy on training data: 1.000000
Accuracy on test data:     0.862069
State: OH, No. of clusters: 6
Accuracy on training data: 1.000000
Accuracy on test data:     0.923077
State: UT, No. of clusters: 5
Accuracy on training data: 1.000000
Accuracy on test data:     0.875000
State: MO, No. of clusters: 4
Accuracy on training data: 1.000000
Accuracy on test data:     0.760000
State: MN, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.833333
State: MI, No. of clusters: 5
Accuracy on training data: 1.000000
Accuracy on test data:     0.818182
State: KS, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.941176
State: MT, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     0.600000
State: MS, No. of clusters: 7
Accuracy on training data: 1.000000
Accuracy on test data:     0.500000
State: SC, No. of clusters: 3
Accuracy on training data: 1.000000
Accuracy on test data:     0.888889
State: KY, No. of clusters: 6
Accuracy on training data: 1.000000
Accuracy on test data:     0.500000
State: OR, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     1.000000
State: SD, No. of clusters: 2
Accuracy on training data: 1.000000
Accuracy on test data:     0.500000



In [79]:

    
from collections import OrderedDict
model_accuracy = OrderedDict(sorted(model_accuracy.items(), key=lambda(k,v):(v,k)))
plt.bar(range(len(model_accuracy)), model_accuracy.values(), align='center')
plt.xticks(range(len(model_accuracy)), list(model_accuracy.keys()), fontsize = 11, rotation='vertical')
plt.axhline(0.7, color="gray")
plt.title('Random forest classifier accuracy', fontsize = 25)
plt.ylabel("Testing accuracy", fontsize = 16)
plt.show()

Clustering and classification work really well for some states (high accuracy) but poorly for others



In [80]:

    
import re
from ipywidgets import widgets
from IPython.display import display


data_to_enter = list(dfs['CA'].columns)
style = {'description_width': 'initial'}

### state 
states = localDict.keys()
states_dropdown = widgets.Dropdown(options=states, value = None, description ="State of residence", style = style)

### programs 
programs = data_to_enter[1:39]
replace = "academics.program_percentage."
programs= [re.sub(replace,'', s) for s in programs] 
program_dropdown1 = widgets.Dropdown(options=programs, value = None, description ="Major choice 1", style = style)
program_dropdown2 = widgets.Dropdown(options=programs, value = None, description ="Major choice 2", style = style)
program_dropdown3 = widgets.Dropdown(options=programs, value = None, description ="Major choice 3", style = style)


### institution type 
inst = data_to_enter[46:50]
replace = "undergrad_length_"
inst= [re.sub(replace,'', s) for s in inst] 
inst_dropdown = widgets.Dropdown(options=inst, value = None, description ="Institution Type", style = style)

### SAT scores
sat_math_slider = widgets.IntSlider(min=cluster_features.sat_scores_midpoint_math.min(), \
                                    max=cluster_features.sat_scores_midpoint_math.max(), \
                                    step=20, description ="SAT score math", \
                                    style=style)
sat_writing_slider = widgets.IntSlider(min=cluster_features.sat_scores_midpoint_writing.min(), \
                                    max=cluster_features.sat_scores_midpoint_writing.max(), \
                                    step=20, description ="SAT score writing", \
                                    style=style)
sat_read_slider = widgets.FloatSlider(min=cluster_features.sat_scores_midpoint_critical_reading.min(), \
                                    max=cluster_features.sat_scores_midpoint_critical_reading.max(), \
                                    step=20, description ="SAT score reading", \
                                    style=style, \
                                    continuous_update = False)
cost_slider = widgets.IntSlider(min=cluster_features.cost_per_academic_year.min(), \
                                    max=cluster_features.cost_per_academic_year.max(), \
                                    step=10000, description ="Cost per year", \
                                    style=style)

### family income
mobility_slider = widgets.IntSlider(min=cluster_features.prop_bottom_income_to_top.min(), \
                                    max=cluster_features.prop_bottom_income_to_top.max(), \
                                    step=1, description ="Economic Mobility", \
                                    style=style)

submit_button = widgets.Button(description ="Run Prediction")



In [81]:

    
import plotly.plotly as py
import plotly.graph_objs as go
import math

def plot_bubble(state, cluster):
    #print("this works!")
    state_df = dfs[state]
    new_df =state_df[state_df['cluster'] == cluster]
    hover_text = []
    bubble_size = []

    for index, row in new_df.iterrows():
        hover_text.append(('School: {school_name}<br>'+
                          'Mobility: {prop_bottom_income_to_top}<br>'+
                          'Cost: {cost_per_academic_year}').format(school_name=row['school_name'],
                                                prop_bottom_income_to_top=row['prop_bottom_income_to_top'],
                                                cost_per_academic_year=row['cost_per_academic_year']))
        bubble_size.append((row['prop_bottom_income_to_top']))

    #new_df['text'] = hover_text
    #new_df['size'] = bubble_size
    sizeref = 2*max(bubble_size)/(60**2) #2*max(new_df['size'])/(60**2)

    trace0 = go.Scatter(
        x=new_df['cost_per_academic_year'],
        y=new_df['median_earnings_in_10_yrs'],
        mode='markers',
        name=state,
        text=hover_text, #new_df['text'], 
        marker=dict(
            symbol='circle',
            sizemode='area',
            sizeref=sizeref/2,
            size=bubble_size, #new_df['size'],
            line=dict(
                width=2
            ),
        )
    )

    data = [trace0]
    layout = go.Layout(
        title='Cost vs Median earnings for your schools in {}. Bubbles are proportional to economic mobility'.format(state),
        xaxis=dict(
            title='Cost per year ',
            gridcolor='rgb(255, 255, 255)',
            #range=[2.003297660701705, 5.191505530708712],
            #type='log',
            #zerolinewidth=1,
            #ticklen=5,
            #gridwidth=2,
        ),
        yaxis=dict(
            title='Median earnings',
            gridcolor='rgb(255, 255, 255)',
            #range=[36.12621671352166, 91.72921793264332],
           # zerolinewidth=1,
            #ticklen=5,
            #gridwidth=2,
        ),
        paper_bgcolor='rgb(243, 243, 243)',
        plot_bgcolor='rgb(243, 243, 243)',
    )

    fig = go.Figure(data=data, layout=layout)
    return py.iplot(fig, filename='test')



In [82]:

    
state ='CA'
cluster =1
def plot_importance(state, model=student_data_model):
    ### -- takes state value and plots importance of features in random forest model -- ###
    feature_list = list(dfs[state].columns) 
    feature_list.remove('school_name')
    feature_list.remove('median_earnings_in_10_yrs')
    feature_list.remove('prop_bottom_income_to_top')
    feature_list.remove('avg_family_income')
    feature_list.remove('cluster')
    #feature_list.remove('mobility_rate')
    importances = model[state].feature_importances_
    feature_importance = pd.DataFrame(zip(feature_list, importances), columns = ('feature','importance'))
    feature_importance.set_index('feature', inplace=True)
    feature_importance.sort_values('importance').plot(kind='barh', figsize=(15, 25))
    plt.axvline(0.1, color="gray")
    plt.title('Feature importance for the random forest classifier', fontsize = 25)
    plt.xlabel('Importance', fontsize = 16)
    plt.ylabel("")
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.show()## Missing data

    
def run_ML(c):
    feature_list = list(dfs[states_dropdown.value].columns) 
    feature_list.remove('school_name')
    feature_list.remove('median_earnings_in_10_yrs')
    feature_list.remove('prop_bottom_income_to_top')
    feature_list.remove('avg_family_income')
    feature_list.remove('cluster')
    #feature_list.remove('mobility_rate')
    array = [0] * len(feature_list)
    p1 = "academics.program_percentage."+program_dropdown1.value
    p2 = "academics.program_percentage."+program_dropdown2.value
    p3 = "academics.program_percentage."+program_dropdown3.value
    inst = "undergrad_length_"+ inst_dropdown.value
    for i in range(0,len(feature_list)):
        if feature_list[i]==p1 or feature_list[i]==p2 or feature_list[i]==p3 or feature_list[i]==inst:
            array[i]=1
        else:
            array[i]=0
        if feature_list[i]=='sat_scores_midpoint_critical_reading':
            array[i]=sat_read_slider.value 
        if feature_list[i]=='sat_scores_midpoint_math':
            array[i]=sat_math_slider.value 
        if feature_list[i]=='sat_scores_midpoint_writing':
            array[i]=sat_writing_slider.value 
        if feature_list[i]=='cost_per_academic_year':
            array[i]=cost_slider.value
        if feature_list[i]=='prop_bottom_income_to_top':
            array[i] = mobility_slider.value 
    X_predict = np.array(array)
    X_predict=X_predict.reshape(1, -1)
    y_predict=student_data_model[states_dropdown.value].predict(X_predict)
    global state
    state = states_dropdown.value
    global cluster
    cluster = int(y_predict)
    print("Prediction accuracy is {}".format(model_accuracy[state]))
    print("Complete! View graph below")



In [83]:

    
display(states_dropdown)
display(program_dropdown1)
display(program_dropdown2)
display(program_dropdown3)
display(sat_math_slider)
display(sat_writing_slider)
display(sat_read_slider)
display(cost_slider)
display(inst_dropdown)
display(submit_button)
#display(mobility_slider)
#, description ="SAT score math"

submit_button.on_click(run_ML)



In [84]:

    
print("Cluster number {} of {} total in the state of {}".format(cluster, cluster_number[state],state)) 
plot_bubble(state, cluster)









    



Cluster number 1 of 11 total in the state of CA






    Out[84]:

Features that contribute most to the school classification in the state



In [85]:

    
plot_importance(state)



In [ ]:

	school_name	number_unemployed
464	NaN	14774.0
483	NaN	4837.0
500	NaN	6005.0
501	NaN	6005.0
502	NaN	6005.0
503	NaN	6005.0
504	NaN	6005.0
505	NaN	6005.0
506	NaN	6005.0
507	NaN	6005.0

	name_broad	debt_income_ratio
2221	Florida Memorial University	1.669884
3039	International Academy Of Design And Technology...	1.588966
459	Central State University	1.564590
1425	Wright Career College	1.470588
2422	Saint Augustine's University	1.469124
391	Shaw University	1.430627
1097	Stillman College	1.415441
1041	Paine College	1.404494
2169	Grambling State University	1.294607
1107	Philander Smith College	1.291639

	name_broad	median_earnings_in_10_yrs
329	Albany College Of Pharmacy And Health Sciences	122600.0
1449	Louisiana State University System	122500.0
2540	Saint Louis College Of Pharmacy	120400.0
2159	MCPHS University	112700.0
1243	Massachusetts Institute Of Technology	94200.0
2853	University Of The Sciences In Philadelphia	91800.0
2474	Babson College	91400.0
1244	Harvard University	90900.0
1054	Georgetown University	90100.0
2333	Stevens Institute Of Technology	87300.0

	name_broad	prop_bottom_income_to_top
2540	Saint Louis College Of Pharmacy	91.929314
2159	MCPHS University	91.293564
329	Albany College Of Pharmacy And Health Sciences	85.208885
122	California Maritime Academy	84.974731
2145	Rose - Hulman Institute Of Technology	78.208527
1896	Advanced Institute Of Hair Design	77.881294
2196	Kettering University	74.689377
139	Harvey Mudd College	74.350357
141	Claremont Mckenna College	68.282814
2474	Babson College	68.151627

	name_broad	mobility_rate
1953	Vaughn College Of Aeronautics And Technology	16.357975
350	CUNY Bernard M. Baruch College	12.938586
2311	City College Of New York - CUNY	11.723747
1955	CUNY Lehman College	10.235138
2575	California State University, Los Angeles	9.918455
2544	CUNY John Jay College Of Criminal Justice	9.691438
2159	MCPHS University	9.343507
312	Pace University	8.432647
1988	State University Of New York At Stony Brook	8.412747
2309	New York City College Of Technology Of The Cit...	8.334076

	name_broad	mobility_rate_to_top_1percent
141	Claremont Mckenna College	1.249444
2159	MCPHS University	0.963851
2793	Kiamichi Technology Center	0.798517
1099	Huntingdon College	0.778688
97	University Of California, Berkeley	0.763982
351	Columbia University In The City Of New York	0.750328
714	University Of Texas Of The Permian Basin	0.737258
1070	California Institute Of Technology	0.723216
350	CUNY Bernard M. Baruch College	0.706509
1391	Maine Maritime Academy	0.693706

	name_broad	prop_bottom_income_to_top	median_earnings_in_10_yrs	sat_scores_average_overall
2449	Nichols College	52.332180	48100.0	940.0
2409	Saint Francis College	49.203571	44400.0	940.0
2318	Dominican College Of Blauvelt	48.201897	45900.0	899.0
98	California State University, East Bay	43.979919	50500.0	910.0
2269	Albertus Magnus College	42.551868	51200.0	830.0
2389	Utica College	38.360474	44100.0	946.0
433	University Of Saint Joseph	37.989376	48800.0	950.0
2544	CUNY John Jay College Of Criminal Justice	35.684139	43300.0	950.0
254	Caldwell College	34.991135	41700.0	949.0
94	California State University, Bakersfield	32.759682	46600.0	901.0

COLLEGE UPWARD MOBILITY

------------------------------------------------------------------

Importing and cleaning the data from College Scorecard

------------------------------------------------------------------

Missing data

The college scorecard data is missing for a number of field in many years. Since 2013 is the most comprehensive data collection year. I will focus on the data collected in 2013. The most recent year of census data is available for 2005. I will impute the 2013 data with census data from 2005 .

Carnegie classifications for colleges

The metrics they calculate are:

Merging datasets

Renaming columns to more intuitive names

Calculating indicies

----------------------------------

Exploratory Data Aanalysis

----------------------------------

Check normality and outliers in the data with histograms

All of these are non-normal with a bias towards lower values. A sqrt transformation should approximate a normal distribution

Tuition is bimodal, cost per academic year is a better metric of cost. The number unemployed has high values for some universities.

University of Phoenix and DeVry are the outliers contributing to this! These are both online, large universities with minimal selectivity and pending lawsuits for poor educational quality. These data points are not really outliers but indicative of these institutions.

median_income_in_neighborhood is categorical. The data dictionary decribes this as continous. I'm not sure what 10 and 11 refer to here so I will refrain from using this. Everything else looks okay.

Visual trends in the data

Selective schools admit a student body that is wealthy

A 10,000 dollar increase in family income increases SAT scores by on average 13.6 points and this relationship is statistically significant

I ran regressions for all SAT scores so that missing scores can be imputed based on average family income at the college. This is important as a lot of SAT score data is missing from the dataset

College selectivity (SAT scores) have a small positive effect on mobility rate. i.e more selective colleges have marginally higher mobility.

selective colleges don't incease upward mobility because they not take in a high proportion of low income students

However, for low income students who can get into more selective colleges, mobility is much higher.

Correlation heatmap to look for important relationships.

Diversity is important for college mobility rate

Elite selective universities and economic mobility

The top debt incuring institutions

Institutions with top earnings

Institutions with highest mobility for those who can get in

Institutions with highest mobility rankings

Institutions with mobility to the top 1%

Which college type is a good predictor of upward mobility?

Applied schools (Buisness, Management, Tech related) tend to be better at increasing economic mobility that those focused on the arts and religion

larger, non-residential schools tend to increase economic mobility

Inclusive schools tend to increase economic mobility

Rural remote schools or those in large cities are better for economic mobility

Minority serving institutions for a single underepresented minority increase mobility by 0.2 percent. Those that serve two or more minorities increase mobility by 1%

Students with the same family income tend to earn more after attending a minority-serving institution. MSIs also tend to be cheaper

Traditional 4-year institutions engender the highest mobility. Two-year, junior colleges decrease mobility by 2.12% and colleges with shorter programs decrease mobility by almost 2.9%

What are the top colleges that increase economic mobility based on their selectivity (SAT scores)?

Top schools for high-access, low selectivity (low SAT scores)

Top schools for intermediate selectivity (median SAT scores)

Top schools for high SAT scores

Are there geographic patterns to economic mobility?

Colleges in mid-western state have surprisingly low economic mobility whereas those with high immigrant populations (NY, TX, CA) have higher mobility on average

Interestingly, colleges in mid-western states are more expensive on average! This might be why mobility rates are on average lower for institutions in these states

ACADEMICS

----------------------------------

College Recommender

----------------------------------

Analysis Pipeline:

imputing the SAT scores based on average family income of the college (slope = 0.006). The two variables are tightly correlated R2 >0.90

Clustering and classification work really well for some states (high accuracy) but poorly for others

Features that contribute most to the school classification in the state