About The SAT by Felix Gaye and Dainel Greenberg
Is there a relatioship between Household income per capita and SAT scores? May 2016
The SAT is a college entrance exam created by the College Board. It is used by a majority of colleges in the United States, as a base metric in which to judge applicants. Because High schools will no doubt vary in difficulty it is often unfair to simply compare students based of their GPA. An exam like the SAT serves as a way to even the playing field.Everyone is given a chance to be directly compete with their peers. The exam is administered 7/year and students are given the opportunity to retake the exam as many ties as they see fit.
Although this exam is meant to serve as an equalizer, we are interested in seeing whether or not average household income plays a role in how high SAT scores are. As we assue that this will play a role, a more important question is how impactful it will be, and whether or not it is a big enough factor to include when comparing students.
Packages imported, we use matplotlib.pyplot to plot scatter plots. We use pandas to allow for data analysis and manipulation. Additionally we imported numpy to be used for scientific computing, and for mathematical caluclations.
In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
%matplotlib inline
Creating the Dataset we used SAT scores from 2014. Our data is all downloadable directly from the CollegeBoard website. We organized our data alphabetically by State, and our analysis will focus on the combined total SAT scores of reading, writing, and math. The data was organized into columns for each subject and then saved as a csv, and read using pandas.
In [2]:
file1 = 'C:/Users/felgaye/Documents/Data Bootcamp Data sheet.csv'
df1 = pd.read_csv(file1)
In [3]:
df1
Out[3]:
Graphs We looked at the household income, and its relationship to SAT score for 2014. For each graph we put SAT scores on the x-axis, and put Income on the Y-axis. The results are in a scatter plot below
In [6]:
#SAT Reading vs Income
df1.plot.scatter('Reading', 'Income', color = 'r')
plt.title('SAT 2014 Reading scores vs. Income')
plt.xlabel('SAT Reading Score')
plt.ylabel('Income')
plt.show()
#SAT Writing vs Income
df1.plot.scatter('Writing', 'Income', color = 'b')
plt.title('SAT 2014 Writing scores vs Income')
plt.xlabel('SAT Writing Score')
plt.ylabel('Income')
plt.show()
#SAT Math vs Income
df1.plot.scatter('Writing', 'Income', color = 'g')
plt.title('SAT 2014 Math scores vs Income')
plt.xlabel('SAT Writing Score')
plt.ylabel('Income')
plt.show()
#SAT Total vs Income
df1.plot.scatter('Total', 'Income', color = 'g')
plt.title('SAT 2014 Total score vs Income')
plt.xlabel('SAT Writing Score')
plt.ylabel('Income')
plt.show()
In [7]:
lm = smf.ols(formula='Reading ~ Income', data=df1).fit()
lm.params
lm.summary()
Out[7]:
In [8]:
lm = smf.ols(formula='Writing ~ Income', data=df1).fit()
lm.params
lm.summary()
Out[8]:
In [9]:
lm = smf.ols(formula='Math ~ Income', data=df1).fit()
lm.params
lm.summary()
Out[9]:
In [10]:
lm = smf.ols(formula='Total ~ Income', data=df1).fit()
lm.params
lm.summary()
Out[10]:
Correlation The OLS regression results indicate that the there is a 0.24 correlation betweeen Total scores and Average Household Income.
In [19]:
np.mean1 = (np.mean(df1.Reading), np.mean(df1.Math), np.mean(df1.Writing))
In [20]:
print(np.mean1)
Conclusion Our results reveal that States with a higher average household income seem to have slightly higher SAT scores. Although the correlation is not weak, it is also not strong enough to come to the strict conclusion that Higher income is causation for higher SAT scores. There is a lot of variability with SAT Scores and Income. This is most notable when looking at our graphs. The state with the lowest total Income is surprisingly on the higher end of SAT scores. Oddly enough our data disproves our initial hypothesis. That being said income isn't indicative of test scores. There are many factors that need to be taken into consideration in our evaluations. Factors as simple as test participation rate can have a huge impact on our data. Although the exam is taken nationwide, some states specifically, those of the midwest are likely to use the ACT. For this reason, students taking the SAT are likely to be more ambitious and also score higher. These small discrepancies are what we assume could be a reason for our incorrect hypothesis.
Further Questions
What would happen if we decided to look at each state on an individual level?
Assuming that Income plays a large role on the State level, What does that mean for how we should compare SAT scores? Would this negate teh SAT's purpose of serving as a metric for which to compare students across the United States?