In this post, we are looking at the SAT (Scholastic Aptitude Test) scores of high schoolers in the US along with other informations.
We will use Python
for the data analysis.
The SAT is a test that high schoolers take in the US before applying to college so it is fairly important. Moreover, SAT scores are used to indicate how good a school district is. There have been allegations about the SAT being unfair (to certain ethnic groups in the US for example). Doing this analysis on New York City data will help shed some light on the fairness of the SAT.
In order to investigate the relation between the SAT and other factors such the ethnic groups or class sizes, we need data. In the Web site of the city of New York, we can find many data sets from which these have been selected:
In what follows, we have a look on the data sets. Then, we clean and unify all the individual datasets into a single one in order to work with the data more easily.
In [33]:
#in order to analyse the data sets, we load them into Python data frames (pandas and numpy libraries)
import pandas as pd
import numpy as np
In [34]:
#initialize the array that will contain the different data sets
data = {}
This data set contains the most recent school level results for New York City on the SAT. Results are available at the school level.
It is to be noticed that the SAT test is divided into 3 sections, each of which is scored out of 800 points (last 3 columns). The total score is out of 2400.
Full data can be downloaded here in CSV format.
In [35]:
#read csv file and show the data head
data["sat_results"] = pd.read_csv("data/{0}.csv".format("SAT_Results"))
data["sat_results"].head()
Out[35]:
This data set contains the annual school accounts of NYC public school student populations served by several information from which ethnicity and gender.
Full data can be downloaded here in CSV format.
In [36]:
#read csv file and show the data head
data["demographics"] = pd.read_csv("data/{0}.csv".format("Demographics_and_Accountability"))
data["demographics"].head()
Out[36]:
In [37]:
#read csv file and show the data head
data["class_size"] = pd.read_csv("data/{0}.csv".format("Class_Size-School_level_detail"))
data["class_size"].head()
Out[37]:
Having a consistent dataset will help us do analysis more quickly. In order to do this, we’ll first need to find a common column to unify them on. Looking at the output above, it appears that "DBN" (District Borough Number) might be that common column, as it appears in multiple datasets. DBN is a unique code for each school.
However:
Thus, we need to clean these them first.
If we look to the data set "class_size" and to the "DBN" column in the other data sets, it looks like the DBN is actually a combination of "CSD" and "SCHOOL CODE". There’s no systematized way to find insights like this in data, and it requires some exploration and playing around to figure out. Let us create the column "DBN" now.
In [38]:
#create the "DBN" column by combining other columns
data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
data["class_size"].head()
Out[38]:
In order to combine these datasets, we’ll need to find a way to condense data sets like "class_size" to the point where there’s only a single row per high school. If not, there won’t be a way to compare SAT scores to class size. We can accomplish this by first understanding the data better, then by doing some aggregation. With the class_size dataset, it looks like GRADE and PROGRAM TYPE have multiple values for each school. By restricting each field to a single value, we can filter most of the duplicate rows.
In [39]:
class_size_temp = data["class_size"]
#restrict the class_size data set to a specific grade and program type
class_size_temp = class_size_temp[class_size_temp["GRADE "] == "09-12"]
class_size_temp = class_size_temp[class_size_temp["PROGRAM TYPE"] == "GEN ED"]
#there still are duplicates, we group the class_size data set by DBN, and take the average of each column
class_size_temp = class_size_temp.groupby("DBN").agg(np.mean)
#reset the index, so DBN is added back in as a column (otherwise, DBN is the index column)
class_size_temp.reset_index(inplace=True)
data["class_size_ge_09-12"] = class_size_temp
data["class_size_ge_09-12"].head()
Out[39]:
Next, we’ll need to condense the demographics dataset. The data was collected for multiple years for the same schools. We will only pick rows where the schoolyear field is the most recent available.
In [40]:
#select only one year
demographics_temp = data["demographics"]
demographics_temp = demographics_temp[demographics_temp["schoolyear"] == 20112012]
data["demographics_2011-2012"] = demographics_temp
data["demographics_2011-2012"].head()
Out[40]:
Now, we compute the total SAT scores from their individual 3 components.
In [41]:
#convert each of the SAT score columns from a string to a number
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors='coerce')
#add together all of the columns to get the sat_score column, which is the total SAT score
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
data['sat_results'].head()
Out[41]:
In [42]:
#joining data sets
data_full = data['sat_results']
data_full = data_full.merge(data["class_size_ge_09-12"], on="DBN", how="inner")
data_full = data_full.merge(data["demographics_2011-2012"], on="DBN", how="inner")
#the structure of the global data set
data_full.shape
Out[42]:
A good way to explore a dataset and see what columns are related to the one you care about is to compute correlations. This will tell you which columns are closely related to the column you’re interested in:
In [43]:
#show the correlations of the columns with the "sat_score" column
data_full_corr = data_full.corr()['sat_score']
data_full_corr.sort_values(ascending=False)
Out[43]:
This gives us a few insights that can be explored. For example:
Each of these items is a potential angle to explore and tell a story about using the data.
In [44]:
#importing the matplotlib library
import matplotlib.pyplot as plt
%matplotlib inline
#2D graph
data_full.corr()["sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()
Out[44]:
It looks like the higher percentages of white and asian students correlate with higher SAT scores, but higher percentages of black and hispanic students correlate with lower SAT scores. For hispanic students, this may be due to the fact that there are more recent immigrants who are english learners.
In [45]:
#2D graph
data_full.corr()["sat_score"][["male_per", "female_per"]].plot.bar()
Out[45]:
To dig more into the correlation, we can make a scatterplot of "female_per" and "sat_score".
In [46]:
#scatter plot
data_full.plot.scatter(x='female_per', y='sat_score')
Out[46]:
It looks like there’s a cluster of schools with a high percentage of females, and very high SAT scores (in the top right). We can get the names of the schools in this cluster.
In [47]:
data_full[(data_full["female_per"] > 65) & (data_full["sat_score"] > 1400)]["SCHOOL NAME"]
Out[47]:
Searching Google reveals that these are elite schools that focus on the performing arts. These schools tend to have higher percentages of females, and higher SAT scores. This likely accounts for the correlation between higher female percentages and SAT scores, and the inverse correlation between higher male percentages and lower SAT scores.
In this post, we have analysed the schools SAT scores and the relation with some other informations like ethnic groups and gender. However, we explored some angles only in the surface, and could have dived into more. Furthermore, we could have combined with other data sets such as schools location informations.
This case study is based on an article of Vik Paruchuri. I thank him for this example.
In [ ]: