Project 1

Step 1: Open the `sat_scores.csv` file. Investigate the data, and answer the questions below.

1. What does the data describe?

The sat-score data describes SAT, standardized college entrance exam, scores from the 50 states and the District of Columbia for test taking year 2001, as provided by the College Board. This data also contains participation in the SAT exam, presumably in percentages and across the U.S. Finally, the last row of data is an aggregator of all 50 states, plus DC, for participation rates and SAT score where verbal and math are separately stated.

2. Does the data look complete? Are there any obvious issues with the observations?

The data does contain a complete listing of SAT scores for all states plus the District of Columbia. The last row contains the nationwide SAT scores and participation, which is not to be included in the 50 state, plus DC view of the data. Additionally, another issue with the data is the unpacked version of the SAT scores given. Therefore, both verbal and math scores are summed in order to get total SAT score.

3. Create a data dictionary for the dataset.

SAT Scores in 2001 Description The sat-score data describes SAT, standardized college entrance exam, scores from the 50 states and the District of Columbia for test taking year 2001, as provided by the College Board. This data also contains participation in the SAT exam, presumably in percentages. Finally, the last row of data is an aggregator of all 50 states, plus DC, for participation rates and SAT score where verbal and math are separately stated.

Methodology Format a panda dataframe from a comma delimited file containing 51 observations on the following 4 variables.

State 50 states of the U.S, plus the District of Columbia

Rate Test participation rate; denoted in percentage by State

Verbal Result of Verbal component of the SAT exam; section graded on a scale of 200–800

Math Result of Math component of the SAT exam; section graded on a scale of 200–800

Total SAT Calculated from source data. Combines the Math and Verbal components of the exam issued in 2001.

Step 2: Load the data.

4. Load the data into a list of lists



In [203]:

    
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline



In [102]:

    
satscores = '/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/sat_scores.csv'



In [103]:

    
rows = []
with open(satscores, 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        rows.append(row)

5. Print the data



In [104]:

    
print rows









    



[['State', 'Rate', 'Verbal', 'Math'], ['CT', '82', '509', '510'], ['NJ', '81', '499', '513'], ['MA', '79', '511', '515'], ['NY', '77', '495', '505'], ['NH', '72', '520', '516'], ['RI', '71', '501', '499'], ['PA', '71', '500', '499'], ['VT', '69', '511', '506'], ['ME', '69', '506', '500'], ['VA', '68', '510', '501'], ['DE', '67', '501', '499'], ['MD', '65', '508', '510'], ['NC', '65', '493', '499'], ['GA', '63', '491', '489'], ['IN', '60', '499', '501'], ['SC', '57', '486', '488'], ['DC', '56', '482', '474'], ['OR', '55', '526', '526'], ['FL', '54', '498', '499'], ['WA', '53', '527', '527'], ['TX', '53', '493', '499'], ['HI', '52', '485', '515'], ['AK', '51', '514', '510'], ['CA', '51', '498', '517'], ['AZ', '34', '523', '525'], ['NV', '33', '509', '515'], ['CO', '31', '539', '542'], ['OH', '26', '534', '439'], ['MT', '23', '539', '539'], ['WV', '18', '527', '512'], ['ID', '17', '543', '542'], ['TN', '13', '562', '553'], ['NM', '13', '551', '542'], ['IL', '12', '576', '589'], ['KY', '12', '550', '550'], ['WY', '11', '547', '545'], ['MI', '11', '561', '572'], ['MN', '9', '580', '589'], ['KS', '9', '577', '580'], ['AL', '9', '559', '554'], ['NE', '8', '562', '568'], ['OK', '8', '567', '561'], ['MO', '8', '577', '577'], ['LA', '7', '564', '562'], ['WI', '6', '584', '596'], ['AR', '6', '562', '550'], ['UT', '5', '575', '570'], ['IA', '5', '593', '603'], ['SD', '4', '577', '582'], ['ND', '4', '592', '599'], ['MS', '4', '566', '551'], ['All', '45', '506', '514']]

6. Extract a list of the labels from the data, and remove them from the data.



In [105]:

    
#Header is list of labels from data
header = rows[0]
header









    Out[105]:





['State', 'Rate', 'Verbal', 'Math']



In [108]:

    
#Data minus Header list
data = rows[1:]
data[0:10]









    Out[108]:





[['CT', '82', '509', '510'],
 ['NJ', '81', '499', '513'],
 ['MA', '79', '511', '515'],
 ['NY', '77', '495', '505'],
 ['NH', '72', '520', '516'],
 ['RI', '71', '501', '499'],
 ['PA', '71', '500', '499'],
 ['VT', '69', '511', '506'],
 ['ME', '69', '506', '500'],
 ['VA', '68', '510', '501']]

7. Create a list of State names extracted from the data. (Hint: use the list of labels to index on the State column)



In [109]:

    
#Exclusive List of States
list_states =[]

for t in data:
    list_states.append(t[0])



In [110]:

    
list_states[0:10]









    Out[110]:





['CT', 'NJ', 'MA', 'NY', 'NH', 'RI', 'PA', 'VT', 'ME', 'VA']



In [ ]:

    
#List of Lists of Rate, SAT Scores
scores_rate = []

for t in data:
    scores_rate.append(t[1:])



In [111]:

    
scores_rate[0:10]









    Out[111]:





[['82', '509', '510'],
 ['81', '499', '513'],
 ['79', '511', '515'],
 ['77', '495', '505'],
 ['72', '520', '516'],
 ['71', '501', '499'],
 ['71', '500', '499'],
 ['69', '511', '506'],
 ['69', '506', '500'],
 ['68', '510', '501']]

8. Print the types of each column



In [116]:

    
type(scores_rate)









    Out[116]:





list

9. Do any types need to be reassigned? If so, go ahead and do it.



In [117]:

    
scores_rate[0:10]









    Out[117]:





[['82', '509', '510'],
 ['81', '499', '513'],
 ['79', '511', '515'],
 ['77', '495', '505'],
 ['72', '520', '516'],
 ['71', '501', '499'],
 ['71', '500', '499'],
 ['69', '511', '506'],
 ['69', '506', '500'],
 ['68', '510', '501']]



In [119]:

    
numerical_list = []
index = []

for x in scores_rate:
    index = list(map(int, x))
    numerical_list.append(index)



In [121]:

    
print numerical_list[0:10]









    



[[82, 509, 510], [81, 499, 513], [79, 511, 515], [77, 495, 505], [72, 520, 516], [71, 501, 499], [71, 500, 499], [69, 511, 506], [69, 506, 500], [68, 510, 501]]



In [130]:

    
type(numerical_list)









    Out[130]:





list

10. Create a dictionary for each column mapping the State to its respective value for that column.



In [126]:

    
header









    Out[126]:





['State', 'Rate', 'Verbal', 'Math']



In [128]:

    
header_m_s = header[1:]
header_m_s









    Out[128]:





['Rate', 'Verbal', 'Math']



In [124]:

    
numerical_list[0:10]









    Out[124]:





[[82, 509, 510],
 [81, 499, 513],
 [79, 511, 515],
 [77, 495, 505],
 [72, 520, 516],
 [71, 501, 499],
 [71, 500, 499],
 [69, 511, 506],
 [69, 506, 500],
 [68, 510, 501]]



In [132]:

    
sat_data = {}



In [133]:

    
for name in header_m_s:
    sat_data[name] = [x[header_m_s.index(name)] for x in numerical_list]



In [134]:

    
sat_data.values()



In [135]:

    
type(sat_data)









    Out[135]:





dict



In [137]:

    
sat_data.keys()









    Out[137]:





['Rate', 'Math', 'Verbal']



In [145]:

    
type(list_states)









    Out[145]:





list

11. Create a dictionary with the values for each of the numeric columns



In [149]:

    
sat_data['Math'][0:10]









    Out[149]:





[510, 513, 515, 505, 516, 499, 499, 506, 500, 501]



In [150]:

    
for i, j in sat_data.items():
    j = [float(x) for x in j]
    sat_data[i] = j



In [151]:

    
sat_data['Math'][0:10]









    Out[151]:





[510.0, 513.0, 515.0, 505.0, 516.0, 499.0, 499.0, 506.0, 500.0, 501.0]



In [152]:

    
sat_data.keys()









    Out[152]:





['Rate', 'Math', 'Verbal']



In [153]:

    
temp = []
dictlist = []



In [154]:

    
#convert dictionary to list
for key, value in sat_data.iteritems():
    temp = [key,value]
    dictlist.append(temp)



In [155]:

    
dictlist









    Out[155]:





[['Rate',
  [82.0,
   81.0,
   79.0,
   77.0,
   72.0,
   71.0,
   71.0,
   69.0,
   69.0,
   68.0,
   67.0,
   65.0,
   65.0,
   63.0,
   60.0,
   57.0,
   56.0,
   55.0,
   54.0,
   53.0,
   53.0,
   52.0,
   51.0,
   51.0,
   34.0,
   33.0,
   31.0,
   26.0,
   23.0,
   18.0,
   17.0,
   13.0,
   13.0,
   12.0,
   12.0,
   11.0,
   11.0,
   9.0,
   9.0,
   9.0,
   8.0,
   8.0,
   8.0,
   7.0,
   6.0,
   6.0,
   5.0,
   5.0,
   4.0,
   4.0,
   4.0,
   45.0]],
 ['Math',
  [510.0,
   513.0,
   515.0,
   505.0,
   516.0,
   499.0,
   499.0,
   506.0,
   500.0,
   501.0,
   499.0,
   510.0,
   499.0,
   489.0,
   501.0,
   488.0,
   474.0,
   526.0,
   499.0,
   527.0,
   499.0,
   515.0,
   510.0,
   517.0,
   525.0,
   515.0,
   542.0,
   439.0,
   539.0,
   512.0,
   542.0,
   553.0,
   542.0,
   589.0,
   550.0,
   545.0,
   572.0,
   589.0,
   580.0,
   554.0,
   568.0,
   561.0,
   577.0,
   562.0,
   596.0,
   550.0,
   570.0,
   603.0,
   582.0,
   599.0,
   551.0,
   514.0]],
 ['Verbal',
  [509.0,
   499.0,
   511.0,
   495.0,
   520.0,
   501.0,
   500.0,
   511.0,
   506.0,
   510.0,
   501.0,
   508.0,
   493.0,
   491.0,
   499.0,
   486.0,
   482.0,
   526.0,
   498.0,
   527.0,
   493.0,
   485.0,
   514.0,
   498.0,
   523.0,
   509.0,
   539.0,
   534.0,
   539.0,
   527.0,
   543.0,
   562.0,
   551.0,
   576.0,
   550.0,
   547.0,
   561.0,
   580.0,
   577.0,
   559.0,
   562.0,
   567.0,
   577.0,
   564.0,
   584.0,
   562.0,
   575.0,
   593.0,
   577.0,
   592.0,
   566.0,
   506.0]]]

Step 3: Describe the data

12. Print the min and max of each column



In [170]:

    
import pandas as pd
satscores = pd.read_csv('/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/sat_scores.csv')



In [171]:

    
satscores.head()



In [172]:

    
sat = pd.DataFrame(sat, columns=['State','Rate','Verbal','Math','Total_SAT'])



In [220]:

    
#Exclude the 'ALL' category from data
sats = sat.iloc[:51]



In [221]:

    
sat['Total_SAT'] = sat['Verbal'] + sat['Math']       #Included an aggregate version of SAT
sat[0:10]



In [222]:

    
print "Participation Rate Min:",sats["Rate"].min()









    



Participation Rate Min: 4



In [223]:

    
print "Participation Rate Max:",sats["Rate"].max()









    



Participation Rate Max: 82



In [224]:

    
print "SAT Math Min:",sats["Math"].min()









    



SAT Math Min: 439



In [225]:

    
print "SAT Math Max:",sats["Math"].max()









    



SAT Math Max: 603



In [226]:

    
print "SAT Verbal Min:",sat["Verbal"].min()









    



SAT Verbal Min: 482



In [227]:

    
print "SAT Verbal Max:",sats["Verbal"].max()









    



SAT Verbal Max: 593



In [228]:

    
print "Total SAT Min:",sat["Total_SAT"].min()









    



Total SAT Min: 956



In [229]:

    
print "Total SAT Max:",sats["Total_SAT"].max()









    



Total SAT Max: 1196



In [230]:

    
def summary_stats(col, data):
    print 'COLUMN: ' + col
    print 'mean: ' + str(np.mean(data))
    print 'median: ' + str(np.median(data))
    print 'mode: ' + str(stats.mode([round(d) for d in data]))
    print 'variance: ' + str(np.var(data))
    print 'standard deviation: ' + str(np.std(data))



In [231]:

    
summary_stats('Rate', sats['Rate'])









    



COLUMN: Rate
mean: 37.0
median: 33.0
mode: ModeResult(mode=array([ 4.]), count=array([3]))
variance: 744.156862745
standard deviation: 27.2792386761



In [232]:

    
summary_stats('Math', sats['Math'])









    



COLUMN: Math
mean: 531.843137255
median: 525.0
mode: ModeResult(mode=array([ 499.]), count=array([6]))
variance: 1290.95578624
standard deviation: 35.9298731731



In [233]:

    
summary_stats('Verbal', sats['Verbal'])









    



COLUMN: Verbal
mean: 532.529411765
median: 527.0
mode: ModeResult(mode=array([ 562.]), count=array([3]))
variance: 1091.11188005
standard deviation: 33.0319826842



In [234]:

    
summary_stats('Total_SAT', sats['Total_SAT'])









    



COLUMN: Total_SAT
mean: 1064.37254902
median: 1048.0
mode: ModeResult(mode=array([ 1000.]), count=array([5]))
variance: 4518.15532488
standard deviation: 67.217224912

13. Write a function using only list comprehensions, no loops, to compute Standard Deviation. Print the Standard Deviation of each numeric column.



In [259]:

    
def stddev(data):
    """returns the standard deviation of lst"""
    m = np.mean(data)
    variance = sum([(i - m)**2 for i in data]) / len(data)
    return np.sqrt(variance)



In [267]:

    
stddev(sats['Rate'])









    Out[267]:





27.279238676053591



In [268]:

    
stddev(sats['Math'])









    Out[268]:





35.929873173114082



In [269]:

    
stddev(sats['Verbal'])









    Out[269]:





33.031982684152283



In [270]:

    
stddev(sats['Total_SAT'])









    Out[270]:





67.217224912034638



In [278]:

    
#Hypothesis testing where
# H0 (null hypothesis): There is no difference between Math and Verbal SAT Scores
# HA (alternative hypothesis): There is a difference between Math and Verbal SAT Scores

a_mean = sats['Math'].mean()
b_mean = sats['Verbal'].mean()

a_var = sats['Math'].var()
b_var = sats['Verbal'].var()

a_n = len(sats['Math'])
b_n = len(sats['Verbal'])

numerator = a_mean - b_mean
denominator = np.sqrt((a_var / a_n) + (b_var / b_n))
z = numerator / denominator
z









    Out[278]:





-0.099427340585701379



In [279]:

    
p_val = 1 - stats.norm.cdf(z)
p_val









    Out[279]:





0.53960051216717997



In [282]:

    
alpha = .01
print p_val, alpha, p_val > alpha









    



0.539600512167 0.01 True

Step 4: Visualize the data

14. Using MatPlotLib and PyPlot, plot the distribution of the Rate using histograms.



In [235]:

    
ax = sns.distplot(sats['Rate'], bins=10)
sns.distplot(sats['Rate'], color='darkred', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution SAT Participation Rate')
plt.show()

15. Plot the Math distribution



In [236]:

    
ax = sns.distplot(sats['Math'], bins=10)
sns.distplot(sats['Math'], color='yellow', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution of Math SAT Scores')
plt.show()

16. Plot the Verbal distribution



In [237]:

    
ax = sns.distplot(sats['Verbal'], bins=10)
sns.distplot(sats['Verbal'], color='darkblue', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution of Verbal SAT Scores')
plt.show()



In [274]:

    
ax = sns.distplot(sats['Total_SAT'], bins=10)
sns.distplot(sats['Total_SAT'], color='darkblue', bins=10, ax=ax)

ax = plt.axes()
ax.set_title('Distribution of Total SAT Scores')
plt.show()

17. What is the typical assumption for data distribution?

A typical assumption of a data distribution is that the distribution is normal or the data is bell-curve shaped.

18. Does that distribution hold true for our data?

No, these numeric fields do not have a normal distribution. The SAT Verbal component is negatively skewed, whereas both Participation Rate and SAT Math are right skewed or positively skewed distribution.

19. Plot some scatterplots. BONUS: Use a PyPlot `figure` to present multiple plots at once.



In [271]:

    
import seaborn as sns
sns.pairplot(sats)
plt.show()

20. Are there any interesting relationships to note?

There seems to be a suggestive proportional relationship between SAT Math, SAT Verbal and Total scores, overall. That is, for example, as verbal scores increase, the math scores proportionally and positively increase. Other variable relationships, however, seem to have a rather inconclusive linear relationship. When considering building a linear regression model to describe Math, Verbal or Total Score one would need to address the outliers the scatter plots above display for each resective scores.

21. Create box plots for each variable.



In [239]:

    
data = [sats['Math'], sats['Verbal']]

fig, ax1 = plt.subplots(figsize=(12, 8))

plt.boxplot(data)

ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)

ax1.set_axisbelow(True)
ax1.set_title('Box Plot of SAT Math / Verbal Scores', y =1.03, fontsize = 24)
ax1.set_xlabel('Features', fontsize = 18)
ax1.set_ylabel('SAT Scores', fontsize = 18)

# Set the axes ranges and axes labels
numBoxes = 2
ax1.set_xlim(0.5, numBoxes + 0.5) 
ax1.set_ylim(400, 625)
xtickNames = plt.setp(ax1, xticklabels=['SAT Math Score', 'SAT Verbal Score'])
plt.setp(xtickNames, fontsize=14)

plt.axhline(625, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)

plt.show()



In [272]:

    
data = [sats['Total_SAT']]

fig, ax1 = plt.subplots(figsize=(12, 8))

plt.boxplot(data)

ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)

ax1.set_axisbelow(True)
ax1.set_title('Box Plot of Total SAT Scores', y =1.03, fontsize = 24)
ax1.set_xlabel('Feature', fontsize = 18)
ax1.set_ylabel('Combined SAT Scores', fontsize = 18)

# Set the axes ranges and axes labels
numBoxes = 1
ax1.set_xlim(0.5, numBoxes + 0.5) 
ax1.set_ylim(900, 1300)
xtickNames = plt.setp(ax1, xticklabels=['Total SAT Scores'])
plt.setp(xtickNames, fontsize=14)

plt.axhline(1300, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)

plt.show()



In [273]:

    
data = [sats['Rate']]

fig, ax1 = plt.subplots(figsize=(12, 8))

plt.boxplot(data)

ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
               alpha=0.5)

ax1.set_axisbelow(True)
ax1.set_title('Box Plot of Participation Rate in SAT Examination', y =1.03, fontsize = 24)
ax1.set_xlabel('Feature', fontsize = 18)
ax1.set_ylabel('Participation Rate', fontsize = 18)

# Set the axes ranges and axes labels
numBoxes = 1
ax1.set_xlim(0.5, numBoxes + 0.5) 
ax1.set_ylim(0, 100)
xtickNames = plt.setp(ax1, xticklabels=['Participation Rate'])
plt.setp(xtickNames, fontsize=14)

plt.axhline(100, color = 'darkgreen')
plt.axvline(1, color = 'darkgreen', linewidth = 1, alpha = 0.4)

plt.show()

BONUS: Using Tableau, create a heat map for each variable using a map of the US.



In [246]:

    
sat.to_csv("/Users/DES/DSI-NYC-5/Projects/project-1-sat-scores/assets/SAT_Scores_DC.csv", sep='\t')

	State	Rate	Verbal	Math	Total_SAT
0	CT	82	509	510	1019
1	NJ	81	499	513	1012
2	MA	79	511	515	1026
3	NY	77	495	505	1000
4	NH	72	520	516	1036
5	RI	71	501	499	1000
6	PA	71	500	499	999
7	VT	69	511	506	1017
8	ME	69	506	500	1006
9	VA	68	510	501	1011