Project 1

Step 1: Open the sat_scores.csv file. Investigate the data, and answer the questions below.

1. What does the data describe?

The sat-score data describes SAT, standardized college entrance exam, scores from the 50 states and the District of Columbia for test taking year 2001, as provided by the College Board. This data also contains participation in the SAT exam, presumably in percentages and across the U.S. Finally, the last row of data is an aggregator of all 50 states, plus DC, for participation rates and SAT score where verbal and math are separately stated.

2. Does the data look complete? Are there any obvious issues with the observations?

The data does contain a complete listing of SAT scores for all states plus the District of Columbia. The last row contains the nationwide SAT scores and participation, which is not to be included in the 50 state, plus DC view of the data. Additionally, another issue with the data is the unpacked version of the SAT scores given. Therefore, both verbal and math scores are summed in order to get total SAT score.

3. Create a data dictionary for the dataset.

SAT Scores in 2001 Description The sat-score data describes SAT, standardized college entrance exam, scores from the 50 states and the District of Columbia for test taking year 2001, as provided by the College Board. This data also contains participation in the SAT exam, presumably in percentages. Finally, the last row of data is an aggregator of all 50 states, plus DC, for participation rates and SAT score where verbal and math are separately stated.

Methodology Format a panda dataframe from a comma delimited file containing 51 observations on the following 4 variables.

State 50 states of the U.S, plus the District of Columbia

Rate Test participation rate; denoted in percentage by State

Verbal Result of Verbal component of the SAT exam; section graded on a scale of 200–800

Math Result of Math component of the SAT exam; section graded on a scale of 200–800

Total SAT Calculated from source data. Combines the Math and Verbal components of the exam issued in 2001.

Step 2: Load the data.

4. Load the data into a list of lists

In [203]:
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [102]:
satscores = '/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/sat_scores.csv'

In [103]:
rows = []
with open(satscores, 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        rows.append(row)
5. Print the data

In [104]:
print rows


[['State', 'Rate', 'Verbal', 'Math'], ['CT', '82', '509', '510'], ['NJ', '81', '499', '513'], ['MA', '79', '511', '515'], ['NY', '77', '495', '505'], ['NH', '72', '520', '516'], ['RI', '71', '501', '499'], ['PA', '71', '500', '499'], ['VT', '69', '511', '506'], ['ME', '69', '506', '500'], ['VA', '68', '510', '501'], ['DE', '67', '501', '499'], ['MD', '65', '508', '510'], ['NC', '65', '493', '499'], ['GA', '63', '491', '489'], ['IN', '60', '499', '501'], ['SC', '57', '486', '488'], ['DC', '56', '482', '474'], ['OR', '55', '526', '526'], ['FL', '54', '498', '499'], ['WA', '53', '527', '527'], ['TX', '53', '493', '499'], ['HI', '52', '485', '515'], ['AK', '51', '514', '510'], ['CA', '51', '498', '517'], ['AZ', '34', '523', '525'], ['NV', '33', '509', '515'], ['CO', '31', '539', '542'], ['OH', '26', '534', '439'], ['MT', '23', '539', '539'], ['WV', '18', '527', '512'], ['ID', '17', '543', '542'], ['TN', '13', '562', '553'], ['NM', '13', '551', '542'], ['IL', '12', '576', '589'], ['KY', '12', '550', '550'], ['WY', '11', '547', '545'], ['MI', '11', '561', '572'], ['MN', '9', '580', '589'], ['KS', '9', '577', '580'], ['AL', '9', '559', '554'], ['NE', '8', '562', '568'], ['OK', '8', '567', '561'], ['MO', '8', '577', '577'], ['LA', '7', '564', '562'], ['WI', '6', '584', '596'], ['AR', '6', '562', '550'], ['UT', '5', '575', '570'], ['IA', '5', '593', '603'], ['SD', '4', '577', '582'], ['ND', '4', '592', '599'], ['MS', '4', '566', '551'], ['All', '45', '506', '514']]
6. Extract a list of the labels from the data, and remove them from the data.

In [105]:
#Header is list of labels from data
header = rows[0]
header


Out[105]:
['State', 'Rate', 'Verbal', 'Math']

In [108]:
#Data minus Header list
data = rows[1:]
data[0:10]


Out[108]:
[['CT', '82', '509', '510'],
 ['NJ', '81', '499', '513'],
 ['MA', '79', '511', '515'],
 ['NY', '77', '495', '505'],
 ['NH', '72', '520', '516'],
 ['RI', '71', '501', '499'],
 ['PA', '71', '500', '499'],
 ['VT', '69', '511', '506'],
 ['ME', '69', '506', '500'],
 ['VA', '68', '510', '501']]
7. Create a list of State names extracted from the data. (Hint: use the list of labels to index on the State column)

In [109]:
#Exclusive List of States
list_states =[]

for t in data:
    list_states.append(t[0])

In [110]:
list_states[0:10]


Out[110]:
['CT', 'NJ', 'MA', 'NY', 'NH', 'RI', 'PA', 'VT', 'ME', 'VA']

In [ ]:
#List of Lists of Rate, SAT Scores
scores_rate = []

for t in data:
    scores_rate.append(t[1:])

In [111]:
scores_rate[0:10]


Out[111]:
[['82', '509', '510'],
 ['81', '499', '513'],
 ['79', '511', '515'],
 ['77', '495', '505'],
 ['72', '520', '516'],
 ['71', '501', '499'],
 ['71', '500', '499'],
 ['69', '511', '506'],
 ['69', '506', '500'],
 ['68', '510', '501']]
8. Print the types of each column

In [116]:
type(scores_rate)


Out[116]:
list
9. Do any types need to be reassigned? If so, go ahead and do it.

In [117]:
scores_rate[0:10]


Out[117]:
[['82', '509', '510'],
 ['81', '499', '513'],
 ['79', '511', '515'],
 ['77', '495', '505'],
 ['72', '520', '516'],
 ['71', '501', '499'],
 ['71', '500', '499'],
 ['69', '511', '506'],
 ['69', '506', '500'],
 ['68', '510', '501']]

In [119]:
numerical_list = []
index = []

for x in scores_rate:
    index = list(map(int, x))
    numerical_list.append(index)

In [121]:
print numerical_list[0:10]


[[82, 509, 510], [81, 499, 513], [79, 511, 515], [77, 495, 505], [72, 520, 516], [71, 501, 499], [71, 500, 499], [69, 511, 506], [69, 506, 500], [68, 510, 501]]

In [130]:
type(numerical_list)


Out[130]:
list
10. Create a dictionary for each column mapping the State to its respective value for that column.

In [126]:
header


Out[126]:
['State', 'Rate', 'Verbal', 'Math']

In [128]:
header_m_s = header[1:]
header_m_s


Out[128]:
['Rate', 'Verbal', 'Math']

In [124]:
numerical_list[0:10]


Out[124]:
[[82, 509, 510],
 [81, 499, 513],
 [79, 511, 515],
 [77, 495, 505],
 [72, 520, 516],
 [71, 501, 499],
 [71, 500, 499],
 [69, 511, 506],
 [69, 506, 500],
 [68, 510, 501]]

In [132]:
sat_data = {}

In [133]:
for name in header_m_s:
    sat_data[name] = [x[header_m_s.index(name)] for x in numerical_list]

In [134]:
sat_data.values()


Out[134]:
[[82,
  81,
  79,
  77,
  72,
  71,
  71,
  69,
  69,
  68,
  67,
  65,
  65,
  63,
  60,
  57,
  56,
  55,
  54,
  53,
  53,
  52,
  51,
  51,
  34,
  33,
  31,
  26,
  23,
  18,
  17,
  13,
  13,
  12,
  12,
  11,
  11,
  9,
  9,
  9,
  8,
  8,
  8,
  7,
  6,
  6,
  5,
  5,
  4,
  4,
  4,
  45],
 [510,
  513,
  515,
  505,
  516,
  499,
  499,
  506,
  500,
  501,
  499,
  510,
  499,
  489,
  501,
  488,
  474,
  526,
  499,
  527,
  499,
  515,
  510,
  517,
  525,
  515,
  542,
  439,
  539,
  512,
  542,
  553,
  542,
  589,
  550,
  545,
  572,
  589,
  580,
  554,
  568,
  561,
  577,
  562,
  596,
  550,
  570,
  603,
  582,
  599,
  551,
  514],
 [509,
  499,
  511,
  495,
  520,
  501,
  500,
  511,
  506,
  510,
  501,
  508,
  493,
  491,
  499,
  486,
  482,
  526,
  498,
  527,
  493,
  485,
  514,
  498,
  523,
  509,
  539,
  534,
  539,
  527,
  543,
  562,
  551,
  576,
  550,
  547,
  561,
  580,
  577,
  559,
  562,
  567,
  577,
  564,
  584,
  562,
  575,
  593,
  577,
  592,
  566,
  506]]

In [135]:
type(sat_data)


Out[135]:
dict

In [137]:
sat_data.keys()


Out[137]:
['Rate', 'Math', 'Verbal']

In [145]:
type(list_states)


Out[145]:
list
11. Create a dictionary with the values for each of the numeric columns

In [149]:
sat_data['Math'][0:10]


Out[149]:
[510, 513, 515, 505, 516, 499, 499, 506, 500, 501]

In [150]:
for i, j in sat_data.items():
    j = [float(x) for x in j]
    sat_data[i] = j

In [151]:
sat_data['Math'][0:10]


Out[151]:
[510.0, 513.0, 515.0, 505.0, 516.0, 499.0, 499.0, 506.0, 500.0, 501.0]

In [152]:
sat_data.keys()


Out[152]:
['Rate', 'Math', 'Verbal']

In [153]:
temp = []
dictlist = []

In [154]:
#convert dictionary to list
for key, value in sat_data.iteritems():
    temp = [key,value]
    dictlist.append(temp)

In [155]:
dictlist


Out[155]:
[['Rate',
  [82.0,
   81.0,
   79.0,
   77.0,
   72.0,
   71.0,
   71.0,
   69.0,
   69.0,
   68.0,
   67.0,
   65.0,
   65.0,
   63.0,
   60.0,
   57.0,
   56.0,
   55.0,
   54.0,
   53.0,
   53.0,
   52.0,
   51.0,
   51.0,
   34.0,
   33.0,
   31.0,
   26.0,
   23.0,
   18.0,
   17.0,
   13.0,
   13.0,
   12.0,
   12.0,
   11.0,
   11.0,
   9.0,
   9.0,
   9.0,
   8.0,
   8.0,
   8.0,
   7.0,
   6.0,
   6.0,
   5.0,
   5.0,
   4.0,
   4.0,
   4.0,
   45.0]],
 ['Math',
  [510.0,
   513.0,
   515.0,
   505.0,
   516.0,
   499.0,
   499.0,
   506.0,
   500.0,
   501.0,
   499.0,
   510.0,
   499.0,
   489.0,
   501.0,
   488.0,
   474.0,
   526.0,
   499.0,
   527.0,
   499.0,
   515.0,
   510.0,
   517.0,
   525.0,
   515.0,
   542.0,
   439.0,
   539.0,
   512.0,
   542.0,
   553.0,
   542.0,
   589.0,
   550.0,
   545.0,
   572.0,
   589.0,
   580.0,
   554.0,
   568.0,
   561.0,
   577.0,
   562.0,
   596.0,
   550.0,
   570.0,
   603.0,
   582.0,
   599.0,
   551.0,
   514.0]],
 ['Verbal',
  [509.0,
   499.0,
   511.0,
   495.0,
   520.0,
   501.0,
   500.0,
   511.0,
   506.0,
   510.0,
   501.0,
   508.0,
   493.0,
   491.0,
   499.0,
   486.0,
   482.0,
   526.0,
   498.0,
   527.0,
   493.0,
   485.0,
   514.0,
   498.0,
   523.0,
   509.0,
   539.0,
   534.0,
   539.0,
   527.0,
   543.0,
   562.0,
   551.0,
   576.0,
   550.0,
   547.0,
   561.0,
   580.0,
   577.0,
   559.0,
   562.0,
   567.0,
   577.0,
   564.0,
   584.0,
   562.0,
   575.0,
   593.0,
   577.0,
   592.0,
   566.0,
   506.0]]]

Step 3: Describe the data

12. Print the min and max of each column

In [170]:
import pandas as pd
satscores = pd.read_csv('/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/sat_scores.csv')

In [171]:
satscores.head()


Out[171]:
State Rate Verbal Math
0 CT 82 509 510
1 NJ 81 499 513
2 MA 79 511 515
3 NY 77 495 505
4 NH 72 520 516

In [172]:
sat = pd.DataFrame(sat, columns=['State','Rate','Verbal','Math','Total_SAT'])

In [220]:
#Exclude the 'ALL' category from data
sats = sat.iloc[:51]

In [221]:
sat['Total_SAT'] = sat['Verbal'] + sat['Math']       #Included an aggregate version of SAT
sat[0:10]


Out[221]:
State Rate Verbal Math Total_SAT
0 CT 82 509 510 1019
1 NJ 81 499 513 1012
2 MA 79 511 515 1026
3 NY 77 495 505 1000
4 NH 72 520 516 1036
5 RI 71 501 499 1000
6 PA 71 500 499 999
7 VT 69 511 506 1017
8 ME 69 506 500 1006
9 VA 68 510 501 1011

In [222]:
print "Participation Rate Min:",sats["Rate"].min()


Participation Rate Min: 4

In [223]:
print "Participation Rate Max:",sats["Rate"].max()


Participation Rate Max: 82

In [224]:
print "SAT Math Min:",sats["Math"].min()


SAT Math Min: 439

In [225]:
print "SAT Math Max:",sats["Math"].max()


SAT Math Max: 603

In [226]:
print "SAT Verbal Min:",sat["Verbal"].min()


SAT Verbal Min: 482

In [227]:
print "SAT Verbal Max:",sats["Verbal"].max()


SAT Verbal Max: 593

In [228]:
print "Total SAT Min:",sat["Total_SAT"].min()


Total SAT Min: 956

In [229]:
print "Total SAT Max:",sats["Total_SAT"].max()


Total SAT Max: 1196

In [230]:
def summary_stats(col, data):
    print 'COLUMN: ' + col
    print 'mean: ' + str(np.mean(data))
    print 'median: ' + str(np.median(data))
    print 'mode: ' + str(stats.mode([round(d) for d in data]))
    print 'variance: ' + str(np.var(data))
    print 'standard deviation: ' + str(np.std(data))

In [231]:
summary_stats('Rate', sats['Rate'])


COLUMN: Rate
mean: 37.0
median: 33.0
mode: ModeResult(mode=array([ 4.]), count=array([3]))
variance: 744.156862745
standard deviation: 27.2792386761

In [232]:
summary_stats('Math', sats['Math'])


COLUMN: Math
mean: 531.843137255
median: 525.0
mode: ModeResult(mode=array([ 499.]), count=array([6]))
variance: 1290.95578624
standard deviation: 35.9298731731

In [233]:
summary_stats('Verbal', sats['Verbal'])


COLUMN: Verbal
mean: 532.529411765
median: 527.0
mode: ModeResult(mode=array([ 562.]), count=array([3]))
variance: 1091.11188005
standard deviation: 33.0319826842

In [234]:
summary_stats('Total_SAT', sats['Total_SAT'])


COLUMN: Total_SAT
mean: 1064.37254902
median: 1048.0
mode: ModeResult(mode=array([ 1000.]), count=array([5]))
variance: 4518.15532488
standard deviation: 67.217224912
13. Write a function using only list comprehensions, no loops, to compute Standard Deviation. Print the Standard Deviation of each numeric column.

In [259]:
def stddev(data):
    """returns the standard deviation of lst"""
    m = np.mean(data)
    variance = sum([(i - m)**2 for i in data]) / len(data)
    return np.sqrt(variance)

In [267]:
stddev(sats['Rate'])


Out[267]:
27.279238676053591

In [268]:
stddev(sats['Math'])


Out[268]:
35.929873173114082

In [269]:
stddev(sats['Verbal'])


Out[269]:
33.031982684152283

In [270]:
stddev(sats['Total_SAT'])


Out[270]:
67.217224912034638

In [278]:
#Hypothesis testing where
# H0 (null hypothesis): There is no difference between Math and Verbal SAT Scores
# HA (alternative hypothesis): There is a difference between Math and Verbal SAT Scores

a_mean = sats['Math'].mean()
b_mean = sats['Verbal'].mean()

a_var = sats['Math'].var()
b_var = sats['Verbal'].var()

a_n = len(sats['Math'])
b_n = len(sats['Verbal'])

numerator = a_mean - b_mean
denominator = np.sqrt((a_var / a_n) + (b_var / b_n))
z = numerator / denominator
z


Out[278]:
-0.099427340585701379

In [279]:
p_val = 1 - stats.norm.cdf(z)
p_val


Out[279]:
0.53960051216717997

In [282]:
alpha = .01
print p_val, alpha, p_val > alpha


0.539600512167 0.01 True

Step 4: Visualize the data

14. Using MatPlotLib and PyPlot, plot the distribution of the Rate using histograms.

In [235]:
ax = sns.distplot(sats['Rate'], bins=10)
sns.distplot(sats['Rate'], color='darkred', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution SAT Participation Rate')
plt.show()


15. Plot the Math distribution

In [236]:
ax = sns.distplot(sats['Math'], bins=10)
sns.distplot(sats['Math'], color='yellow', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution of Math SAT Scores')
plt.show()


16. Plot the Verbal distribution

In [237]:
ax = sns.distplot(sats['Verbal'], bins=10)
sns.distplot(sats['Verbal'], color='darkblue', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution of Verbal SAT Scores')
plt.show()



In [274]:
ax = sns.distplot(sats['Total_SAT'], bins=10)
sns.distplot(sats['Total_SAT'], color='darkblue', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution of Total SAT Scores')
plt.show()


17. What is the typical assumption for data distribution?

A typical assumption of a data distribution is that the distribution is normal or the data is bell-curve shaped.

18. Does that distribution hold true for our data?

No, these numeric fields do not have a normal distribution. The SAT Verbal component is negatively skewed, whereas both Participation Rate and SAT Math are right skewed or positively skewed distribution.

19. Plot some scatterplots. BONUS: Use a PyPlot figure to present multiple plots at once.

In [271]:
import seaborn as sns
sns.pairplot(sats)
plt.show()


20. Are there any interesting relationships to note?

There seems to be a suggestive proportional relationship between SAT Math, SAT Verbal and Total scores, overall. That is, for example, as verbal scores increase, the math scores proportionally and positively increase. Other variable relationships, however, seem to have a rather inconclusive linear relationship. When considering building a linear regression model to describe Math, Verbal or Total Score one would need to address the outliers the scatter plots above display for each resective scores.

21. Create box plots for each variable.

In [239]:
data = [sats['Math'], sats['Verbal']]

fig, ax1 = plt.subplots(figsize=(12, 8))

plt.boxplot(data)

ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)

ax1.set_axisbelow(True)
ax1.set_title('Box Plot of SAT Math / Verbal Scores', y =1.03, fontsize = 24)
ax1.set_xlabel('Features', fontsize = 18)
ax1.set_ylabel('SAT Scores', fontsize = 18)

# Set the axes ranges and axes labels
numBoxes = 2
ax1.set_xlim(0.5, numBoxes + 0.5) 
ax1.set_ylim(400, 625)
xtickNames = plt.setp(ax1, xticklabels=['SAT Math Score', 'SAT Verbal Score'])
plt.setp(xtickNames, fontsize=14)

plt.axhline(625, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)

plt.show()



In [272]:
data = [sats['Total_SAT']]

fig, ax1 = plt.subplots(figsize=(12, 8))

plt.boxplot(data)

ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)

ax1.set_axisbelow(True)
ax1.set_title('Box Plot of Total SAT Scores', y =1.03, fontsize = 24)
ax1.set_xlabel('Feature', fontsize = 18)
ax1.set_ylabel('Combined SAT Scores', fontsize = 18)

# Set the axes ranges and axes labels
numBoxes = 1
ax1.set_xlim(0.5, numBoxes + 0.5) 
ax1.set_ylim(900, 1300)
xtickNames = plt.setp(ax1, xticklabels=['Total SAT Scores'])
plt.setp(xtickNames, fontsize=14)

plt.axhline(1300, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)

plt.show()



In [273]:
data = [sats['Rate']]

fig, ax1 = plt.subplots(figsize=(12, 8))

plt.boxplot(data)

ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)

ax1.set_axisbelow(True)
ax1.set_title('Box Plot of Participation Rate in SAT Examination', y =1.03, fontsize = 24)
ax1.set_xlabel('Feature', fontsize = 18)
ax1.set_ylabel('Participation Rate', fontsize = 18)

# Set the axes ranges and axes labels
numBoxes = 1
ax1.set_xlim(0.5, numBoxes + 0.5) 
ax1.set_ylim(0, 100)
xtickNames = plt.setp(ax1, xticklabels=['Participation Rate'])
plt.setp(xtickNames, fontsize=14)

plt.axhline(100, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)

plt.show()


BONUS: Using Tableau, create a heat map for each variable using a map of the US.

In [246]:
sat.to_csv("/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/SAT_Scores_DC.csv", sep='\t')