The sat-score data describes SAT, standardized college entrance exam, scores from the 50 states and the District of Columbia for test taking year 2001, as provided by the College Board. This data also contains participation in the SAT exam, presumably in percentages and across the U.S. Finally, the last row of data is an aggregator of all 50 states, plus DC, for participation rates and SAT score where verbal and math are separately stated.
The data does contain a complete listing of SAT scores for all states plus the District of Columbia. The last row contains the nationwide SAT scores and participation, which is not to be included in the 50 state, plus DC view of the data. Additionally, another issue with the data is the unpacked version of the SAT scores given. Therefore, both verbal and math scores are summed in order to get total SAT score.
SAT Scores in 2001 Description The sat-score data describes SAT, standardized college entrance exam, scores from the 50 states and the District of Columbia for test taking year 2001, as provided by the College Board. This data also contains participation in the SAT exam, presumably in percentages. Finally, the last row of data is an aggregator of all 50 states, plus DC, for participation rates and SAT score where verbal and math are separately stated.
Methodology Format a panda dataframe from a comma delimited file containing 51 observations on the following 4 variables.
State 50 states of the U.S, plus the District of Columbia
Rate Test participation rate; denoted in percentage by State
Verbal Result of Verbal component of the SAT exam; section graded on a scale of 200–800
Math Result of Math component of the SAT exam; section graded on a scale of 200–800
Total SAT Calculated from source data. Combines the Math and Verbal components of the exam issued in 2001.
In [203]:
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [102]:
satscores = '/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/sat_scores.csv'
In [103]:
rows = []
with open(satscores, 'r') as f:
reader = csv.reader(f)
for row in reader:
rows.append(row)
In [104]:
print rows
In [105]:
#Header is list of labels from data
header = rows[0]
header
Out[105]:
In [108]:
#Data minus Header list
data = rows[1:]
data[0:10]
Out[108]:
In [109]:
#Exclusive List of States
list_states =[]
for t in data:
list_states.append(t[0])
In [110]:
list_states[0:10]
Out[110]:
In [ ]:
#List of Lists of Rate, SAT Scores
scores_rate = []
for t in data:
scores_rate.append(t[1:])
In [111]:
scores_rate[0:10]
Out[111]:
In [116]:
type(scores_rate)
Out[116]:
In [117]:
scores_rate[0:10]
Out[117]:
In [119]:
numerical_list = []
index = []
for x in scores_rate:
index = list(map(int, x))
numerical_list.append(index)
In [121]:
print numerical_list[0:10]
In [130]:
type(numerical_list)
Out[130]:
In [126]:
header
Out[126]:
In [128]:
header_m_s = header[1:]
header_m_s
Out[128]:
In [124]:
numerical_list[0:10]
Out[124]:
In [132]:
sat_data = {}
In [133]:
for name in header_m_s:
sat_data[name] = [x[header_m_s.index(name)] for x in numerical_list]
In [134]:
sat_data.values()
Out[134]:
In [135]:
type(sat_data)
Out[135]:
In [137]:
sat_data.keys()
Out[137]:
In [145]:
type(list_states)
Out[145]:
In [149]:
sat_data['Math'][0:10]
Out[149]:
In [150]:
for i, j in sat_data.items():
j = [float(x) for x in j]
sat_data[i] = j
In [151]:
sat_data['Math'][0:10]
Out[151]:
In [152]:
sat_data.keys()
Out[152]:
In [153]:
temp = []
dictlist = []
In [154]:
#convert dictionary to list
for key, value in sat_data.iteritems():
temp = [key,value]
dictlist.append(temp)
In [155]:
dictlist
Out[155]:
In [170]:
import pandas as pd
satscores = pd.read_csv('/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/sat_scores.csv')
In [171]:
satscores.head()
Out[171]:
In [172]:
sat = pd.DataFrame(sat, columns=['State','Rate','Verbal','Math','Total_SAT'])
In [220]:
#Exclude the 'ALL' category from data
sats = sat.iloc[:51]
In [221]:
sat['Total_SAT'] = sat['Verbal'] + sat['Math'] #Included an aggregate version of SAT
sat[0:10]
Out[221]:
In [222]:
print "Participation Rate Min:",sats["Rate"].min()
In [223]:
print "Participation Rate Max:",sats["Rate"].max()
In [224]:
print "SAT Math Min:",sats["Math"].min()
In [225]:
print "SAT Math Max:",sats["Math"].max()
In [226]:
print "SAT Verbal Min:",sat["Verbal"].min()
In [227]:
print "SAT Verbal Max:",sats["Verbal"].max()
In [228]:
print "Total SAT Min:",sat["Total_SAT"].min()
In [229]:
print "Total SAT Max:",sats["Total_SAT"].max()
In [230]:
def summary_stats(col, data):
print 'COLUMN: ' + col
print 'mean: ' + str(np.mean(data))
print 'median: ' + str(np.median(data))
print 'mode: ' + str(stats.mode([round(d) for d in data]))
print 'variance: ' + str(np.var(data))
print 'standard deviation: ' + str(np.std(data))
In [231]:
summary_stats('Rate', sats['Rate'])
In [232]:
summary_stats('Math', sats['Math'])
In [233]:
summary_stats('Verbal', sats['Verbal'])
In [234]:
summary_stats('Total_SAT', sats['Total_SAT'])
In [259]:
def stddev(data):
"""returns the standard deviation of lst"""
m = np.mean(data)
variance = sum([(i - m)**2 for i in data]) / len(data)
return np.sqrt(variance)
In [267]:
stddev(sats['Rate'])
Out[267]:
In [268]:
stddev(sats['Math'])
Out[268]:
In [269]:
stddev(sats['Verbal'])
Out[269]:
In [270]:
stddev(sats['Total_SAT'])
Out[270]:
In [278]:
#Hypothesis testing where
# H0 (null hypothesis): There is no difference between Math and Verbal SAT Scores
# HA (alternative hypothesis): There is a difference between Math and Verbal SAT Scores
a_mean = sats['Math'].mean()
b_mean = sats['Verbal'].mean()
a_var = sats['Math'].var()
b_var = sats['Verbal'].var()
a_n = len(sats['Math'])
b_n = len(sats['Verbal'])
numerator = a_mean - b_mean
denominator = np.sqrt((a_var / a_n) + (b_var / b_n))
z = numerator / denominator
z
Out[278]:
In [279]:
p_val = 1 - stats.norm.cdf(z)
p_val
Out[279]:
In [282]:
alpha = .01
print p_val, alpha, p_val > alpha
In [235]:
ax = sns.distplot(sats['Rate'], bins=10)
sns.distplot(sats['Rate'], color='darkred', bins=10, ax=ax)
ax = plt.axes()
ax.set_title('Distribution SAT Participation Rate')
plt.show()
In [236]:
ax = sns.distplot(sats['Math'], bins=10)
sns.distplot(sats['Math'], color='yellow', bins=10, ax=ax)
ax = plt.axes()
ax.set_title('Distribution of Math SAT Scores')
plt.show()
In [237]:
ax = sns.distplot(sats['Verbal'], bins=10)
sns.distplot(sats['Verbal'], color='darkblue', bins=10, ax=ax)
ax = plt.axes()
ax.set_title('Distribution of Verbal SAT Scores')
plt.show()
In [274]:
ax = sns.distplot(sats['Total_SAT'], bins=10)
sns.distplot(sats['Total_SAT'], color='darkblue', bins=10, ax=ax)
ax = plt.axes()
ax.set_title('Distribution of Total SAT Scores')
plt.show()
A typical assumption of a data distribution is that the distribution is normal or the data is bell-curve shaped.
No, these numeric fields do not have a normal distribution. The SAT Verbal component is negatively skewed, whereas both Participation Rate and SAT Math are right skewed or positively skewed distribution.
In [271]:
import seaborn as sns
sns.pairplot(sats)
plt.show()
There seems to be a suggestive proportional relationship between SAT Math, SAT Verbal and Total scores, overall. That is, for example, as verbal scores increase, the math scores proportionally and positively increase. Other variable relationships, however, seem to have a rather inconclusive linear relationship. When considering building a linear regression model to describe Math, Verbal or Total Score one would need to address the outliers the scatter plots above display for each resective scores.
In [239]:
data = [sats['Math'], sats['Verbal']]
fig, ax1 = plt.subplots(figsize=(12, 8))
plt.boxplot(data)
ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
alpha=0.5)
ax1.set_axisbelow(True)
ax1.set_title('Box Plot of SAT Math / Verbal Scores', y =1.03, fontsize = 24)
ax1.set_xlabel('Features', fontsize = 18)
ax1.set_ylabel('SAT Scores', fontsize = 18)
# Set the axes ranges and axes labels
numBoxes = 2
ax1.set_xlim(0.5, numBoxes + 0.5)
ax1.set_ylim(400, 625)
xtickNames = plt.setp(ax1, xticklabels=['SAT Math Score', 'SAT Verbal Score'])
plt.setp(xtickNames, fontsize=14)
plt.axhline(625, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)
plt.show()
In [272]:
data = [sats['Total_SAT']]
fig, ax1 = plt.subplots(figsize=(12, 8))
plt.boxplot(data)
ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
alpha=0.5)
ax1.set_axisbelow(True)
ax1.set_title('Box Plot of Total SAT Scores', y =1.03, fontsize = 24)
ax1.set_xlabel('Feature', fontsize = 18)
ax1.set_ylabel('Combined SAT Scores', fontsize = 18)
# Set the axes ranges and axes labels
numBoxes = 1
ax1.set_xlim(0.5, numBoxes + 0.5)
ax1.set_ylim(900, 1300)
xtickNames = plt.setp(ax1, xticklabels=['Total SAT Scores'])
plt.setp(xtickNames, fontsize=14)
plt.axhline(1300, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)
plt.show()
In [273]:
data = [sats['Rate']]
fig, ax1 = plt.subplots(figsize=(12, 8))
plt.boxplot(data)
ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
alpha=0.5)
ax1.set_axisbelow(True)
ax1.set_title('Box Plot of Participation Rate in SAT Examination', y =1.03, fontsize = 24)
ax1.set_xlabel('Feature', fontsize = 18)
ax1.set_ylabel('Participation Rate', fontsize = 18)
# Set the axes ranges and axes labels
numBoxes = 1
ax1.set_xlim(0.5, numBoxes + 0.5)
ax1.set_ylim(0, 100)
xtickNames = plt.setp(ax1, xticklabels=['Participation Rate'])
plt.setp(xtickNames, fontsize=14)
plt.axhline(100, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)
plt.show()
In [246]:
sat.to_csv("/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/SAT_Scores_DC.csv", sep='\t')