Initial Eploration

Goals

  • Load dataset into sqlite server
  • Make basic queries against database
  • Understand basic structure and fields of dataset
  • Start exploring different aspects of the dataset and determine possible places of interest to explore in further detail

Imports


In [156]:
%matplotlib inline

import matplotlib.pyplot as plt
import sqlite3
import pandas as pd
import seaborn as sns

sns.set_style("white")

Load Dataset

I want to start an sqlite server, load the dataset into an active sqlite server, and be save that functionality into a method to be run in the future.


In [79]:
ls -l ../data/output | grep -v "M*.csv"


total 8269440
-rw-r--r--@ 1 graysonricketts  staff      861159 Sep 23  2015 CollegeScorecardDataDictionary-09-12-2015.pdf
-rw-r--r--@ 1 graysonricketts  staff    32585814 Sep 23  2015 Crosswalk_ZIP.zip
drwxr-xr-x@ 7 graysonricketts  staff         238 Sep 23  2015 Data_File_Cohort_Map/
-rw-r--r--@ 1 graysonricketts  staff      467161 Sep 23  2015 FullDataDocumentation.pdf
-rw-r--r--@ 1 graysonricketts  staff      390848 Sep 23  2015 data_dictionary.yaml
-rw-r--r--@ 1 graysonricketts  staff  1201238016 Sep 23  2015 database.sqlite
-rw-r--r--@ 1 graysonricketts  staff        1496 Sep 23  2015 hashes.txt

In [57]:
conn = sqlite3.connect('../data/output/database.sqlite')
c = conn.cursor()

def execute(sql):
    '''
    Executes a SQL command on the 'c' cursor and returns the results
    '''
    c.execute(sql)
    return c.fetchall()

Apparently, Sqlite is a serverless database (pretty cool, didn't know) so all I need to do is give it a file path, then I can start making queries. I created an execute method to save me couple of keystrokes and make it easier if I change the cursor.

Helper methods


In [157]:
def printByYear(data):
    '''
    Given a list of tuples with (year, data), prints the data next to corresponding year
    '''
    for datum in data:
        print "{0}: {1}".format(datum[0], datum[1])

Basic queries and exploration


In [58]:
# Print all tables in the database
tables = execute("SELECT Name FROM sqlite_master WHERE type='table'")
for table in tables:
    print table


(u'Scorecard',)

There is one table and it is called Scorecard.


In [82]:
# Print the number of rows in the database
rowCount = execute("SELECT Count(id) RowCount from Scorecard")
rowCount = rowCount[0][0]
print "Row count:", rowCount


Row count: 124699

In [158]:
# Number of intstiutions in dataset per year
rowCountByYear = execute("""SELECT Year, Count(id)
                            FROM Scorecard
                            GROUP BY Year""")
printByYear(rowCountByYear)


1996: 6794
1997: 6699
1998: 6480
1999: 6466
2000: 6478
2001: 6619
2002: 6546
2003: 6585
2004: 6660
2005: 6824
2006: 6848
2007: 6890
2008: 6975
2009: 7149
2010: 7414
2011: 7675
2012: 7793
2013: 7804

In [60]:
# Print the number of columns
fields = execute("PRAGMA table_info(Scorecard)")
print len(fields)


1731

There are too many columns to do an analysis of the data compostion from the above query. In order to understand what is included in the dataset I am looking through the FullDataDocumentation.pdf which covers the different categories of data and summarizes what is include in each category. Below is a table of things I found of interest.

All the fields in the database are in CollegeScorecardDataDictionary-09-12-2015.pdf along with the field name.

Category Field Type Details
School
Name Name of the school
Location Longitude, latitude, city, etc.
Main campus or branch 1 for main, 0 for branch
Type Public, private for-profit, private non-profit
Revenue Net tutition, instructional expenses, average faculty salary
Currently operating 1 for YES, 0 for NO
Academics
Programs offered Field of study ids
Admission
Admission rate for Undergraduates Rates for each branch or all branches
SAT and ACT scores
Costs
LOTS Look at page 6-7 of data documentation
Student Body
Number of degree seekers
Race statistics Self-reported on FAFSA
Undegrads by family income Percentages broken into categories based on expected family contribution on FAFSA
Retention rate
Student body age Percentage of students who are age 25-64
Parent education Percentages of students with parents that have some level of education. Collected from FAFSA.
Financial Aid
Cumulative median debt
Percent receiving Pell Grant Grant from government that does not need to be repaid
Percent receiving federal loans
Completion
Completion rate Percent of students graduating within 150% and 200% of the expected time to graduate
Title IV Students
Earnings
Mean and median income Available 6 and 10 years after graduation. Data comes from W2 tax forms
Threshold earnings Percent of students who are earning more than people age 25-34 with only a college degree. Measure of wether or not the degree was able to financially improve the students outcome had they not gone to college.
Repayment

In [167]:
# Get locations of the universities
coordinates = execute("""SELECT Latitude, Longitude 
                         FROM Scorecard
                         WHERE Latitude IS NOT NULL
                                AND Year=2013
                                AND main='Main campus'""")

In [168]:
print "Percent with location data (2013): {0: .2f}%".format(((len(coordinates)*1.0) / rowCountByYear[14][1]) * 100)

# Plot locations of institutions
df = pd.DataFrame()
def checkCordinates(x, y):
    if x >= -128.19 and x <= -65 and y >= 24.19 and y <= 49.62:
        return True
    return False
df['x'] = [row[1] for row in coordinates if checkCordinates(row[1], row[0])]
df['y'] = [row[0] for row in coordinates if checkCordinates(row[1], row[0])]

locations = sns.regplot('x', 'y',
                        data=df,
                        fit_reg=False)
locations.set(title="Locations of Institutions", xticks=[], yticks=[], xlabel="", ylabel="")
sns.despine(left=True, bottom=True)


Percent with location data (2013):  74.45%

In [161]:
# Number of institutions that are main campuses by year
mainCampuses = execute("""SELECT Year, Count(id) mainCount
                          FROM Scorecard
                          WHERE main='Main campus'
                          GROUP BY Year""")

In [166]:
print "Number of total main campuses: {0}".format(sum([count[1] for count in mainCampuses]))
printByYear(mainCampuses)

print "\nMain campus percentages"
for i, count in enumerate(mainCampuses):
    print "{0}: {1: .2f}%".format(count[0], ((count[1]*1.0)/rowCountByYear[i][1])*100)


Number of total main campuses: 100918
1996: 5869
1997: 5727
1998: 5513
1999: 5509
2000: 5504
2001: 5523
2002: 5498
2003: 5477
2004: 5492
2005: 5565
2006: 5565
2007: 5542
2008: 5556
2009: 5622
2010: 5729
2011: 5757
2012: 5761
2013: 5709

Main campus percentages
1996:  86.39%
1997:  85.49%
1998:  85.08%
1999:  85.20%
2000:  84.96%
2001:  83.44%
2002:  83.99%
2003:  83.17%
2004:  82.46%
2005:  81.55%
2006:  81.26%
2007:  80.44%
2008:  79.66%
2009:  78.64%
2010:  77.27%
2011:  75.01%
2012:  73.93%
2013:  73.15%

Ideas so far

I think a visualization with a picutre of all the universities in the US would be cool. Maybe as they add certain constaints the user oculd see the dots become highlighted and unlighted. Along with that there would be bar on the right with the most prominent universities in the categoires. Could also allow them to pin certain universities or institutions they were interested in.

As for direction to explore, I kind of what to explore everything, or at least let the user explore everything.


In [ ]: