Project 2

In this project, you will implement the exploratory analysis plan developed in Project 1. This will lay the groundwork for our our first modeling exercise in Project 3.

Step 1: Load the python libraries you will need for this project


In [19]:
#imports
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline

Step 2: Read in your data set


In [20]:
#Read in data from source 
df_raw = pd.read_csv("../assets/admissions.csv")
print df_raw.head()


   admit  gre   gpa  prestige
0      0  380  3.61         3
1      1  660  3.67         3
2      1  800  4.00         1
3      1  640  3.19         4
4      0  520  2.93         4

Questions

Question 1. How many observations are in our dataset?


In [21]:
df_raw['admit'].count()
df_raw['gpa'].count()
df_raw.shape
rows,columns = df_raw.shape
print(rows)
print(columns)


400
4

Answer: 400 observations. These 400 observations are displayed within 4 rows of 400 observations each.

Question 2. Create a summary table


In [22]:
#function
def summary_table():
    #creates a summary table for df_raw using .describe()
    df_raw.describe()
    return x
    print X

In [23]:
df_raw.describe()


Out[23]:
admit gre gpa prestige
count 400.000000 398.000000 398.00000 399.000000
mean 0.317500 588.040201 3.39093 2.486216
std 0.466087 115.628513 0.38063 0.945333
min 0.000000 220.000000 2.26000 1.000000
25% 0.000000 520.000000 3.13000 2.000000
50% 0.000000 580.000000 3.39500 2.000000
75% 1.000000 660.000000 3.67000 3.000000
max 1.000000 800.000000 4.00000 4.000000

Question 3. Why would GRE have a larger STD than GPA?

Answer: The GRE variable has a larger 'std' value since the range of GRE scores varies from 220 to 800 while the range for GPA varies from 2.26 to 4.00.

Question 4. Drop data points with missing data


In [24]:
df_raw.dropna() 
#drops any missing data rows from admissions.csv dataset 
#returns 397 observations (complete observation rows) across 4 columns
#3 rows had missing, incomplete, NaN data present


Out[24]:
admit gre gpa prestige
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
5 1 760 3.00 2
6 1 560 2.98 1
7 0 400 3.08 2
8 1 540 3.39 3
9 0 700 3.92 2
10 0 800 4.00 4
11 0 440 3.22 1
12 1 760 4.00 1
13 0 700 3.08 2
14 1 700 4.00 1
15 0 480 3.44 3
16 0 780 3.87 4
17 0 360 2.56 3
18 0 800 3.75 2
19 1 540 3.81 1
20 0 500 3.17 3
21 1 660 3.63 2
22 0 600 2.82 4
23 0 680 3.19 4
24 1 760 3.35 2
25 1 800 3.66 1
26 1 620 3.61 1
27 1 520 3.74 4
28 1 780 3.22 2
29 0 520 3.29 1
... ... ... ... ...
370 1 540 3.77 2
371 1 680 3.76 3
372 1 680 2.42 1
373 1 620 3.37 1
374 0 560 3.78 2
375 0 560 3.49 4
376 0 620 3.63 2
377 1 800 4.00 2
378 0 640 3.12 3
379 0 540 2.70 2
380 0 700 3.65 2
381 1 540 3.49 2
382 0 540 3.51 2
383 0 660 4.00 1
384 1 480 2.62 2
385 0 420 3.02 1
386 1 740 3.86 2
387 0 580 3.36 2
388 0 640 3.17 2
389 0 640 3.51 2
390 1 800 3.05 2
391 1 660 3.88 2
392 1 600 3.38 3
393 1 620 3.75 2
394 1 460 3.99 3
395 0 620 4.00 2
396 0 560 3.04 3
397 0 460 2.63 2
398 0 700 3.65 2
399 0 600 3.89 3

397 rows × 4 columns

Question 5. Confirm that you dropped the correct data. How can you tell?

Answer: Code in question one returned 400 observations across 4 rows. Culled data using '.dropna()' method returns 397 observational rows, implying that three rows had been removed due to NaN data being present.

Question 6. Create box plots for GRE and GPA


In [25]:
#boxplot for GRE column data
df_raw.boxplot(column = 'gre', return_type = 'axes')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a27be50>

In [26]:
#boxplot for GPA column data
df_raw.boxplot(column = 'gpa', return_type = 'axes')


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a7abcd0>

Question 7. What do this plots show?

Answer: GRE Boxplot: The mean for this variable lies just south of 600 (around 580) and the interquartile range lies between 650 and 510 as indicated by the blue square. The box plot displays a significant outlier at 300 which has not been included into the range as it falls well outside the acceptable standard deviation from the mean. Further, this value is below the lower extreme of variable GPA.

GPA Boxplot: The mean GPA value falls right at ~3.40 with the interquartile range falling between ~3.64 at the upper quartile and ~3.18 at the lower quartile. The lower extreme of this data is right at 2.4 while the upper extreme extends beyond 4.00 despite the maximum of this data being 4.00.

Question 8. Describe each distribution


In [63]:
# distribution plot of 'admit' variable with mean
df_raw.admit.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.admit.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=2.0,
           linewidth=4.0)


Out[63]:
<matplotlib.collections.LineCollection at 0x125453510>

In [64]:
# distribution plot of 'gre' variable with mean
df_raw.gre.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.gre.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.0035,
           linewidth=4.0)


Out[64]:
<matplotlib.collections.LineCollection at 0x1255c0f90>

In [65]:
# distribution plot of 'gpa' variable with mean
df_raw.gpa.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.gpa.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=1.0,
           linewidth=4.0)


Out[65]:
<matplotlib.collections.LineCollection at 0x125113750>

In [66]:
# distribution plot of 'prestige' variable with mean
df_raw.prestige.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.prestige.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.6,
           linewidth=4.0)


Out[66]:
<matplotlib.collections.LineCollection at 0x126296090>

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer: We would not meet that requirement as only the variable 'gre' displays itself in a quasi normal distribution. The variables for admit, gpa, and prestige are abnormally distributed.

Question 10. Does this distribution need correction? If so, why? How?

Answer: Yes, this data does need to be corrected. If we are to compare these variables through linear regression or other statistics inferential tools, the data must be normalized in order to conform to a more normal distribution. We can accomplish this by using breating a new dataframe like so: (df_norm = (df_raw - df_raw.mean()) / (df_raw.max() - df.min())

Sourced solution for normalization: http://stackoverflow.com/questions/12525722/normalize-data-in-pandas

Question 11. Which of our variables are potentially colinear?


In [67]:
# correlation matrix for variables in df_raw
df_raw.corr()


Out[67]:
admit gre gpa prestige
admit 1.000000 0.182919 0.175952 -0.241355
gre 0.182919 1.000000 0.382408 -0.124533
gpa 0.175952 0.382408 1.000000 -0.059031
prestige -0.241355 -0.124533 -0.059031 1.000000

Question 12. What did you find?

Answer: The strongest, most interesting correlation that exists between two variables exist between variables 'admit' and 'prestige'. The two variables are negatively correlated (-0.241). This would imply that as the prestige of your school increases by one unit your likelyhood of admission to UCLA decreases by a factor of -0.241 (or -25%), holding all other variables constant. GPA and GRE variables are positively correlated in that as your GPA increases, your GRE score increases by a factor of 0.382408.

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Answer: I will examine the relationship between variables 'prestige' and 'admit' using the admissions.csv database in order to determine if the two varibles are correlated and if they are causally linked. Further, I will determine if this relationship is statistically significant.

Question 14. What is your hypothesis?

Answer: H1 = There exists a statistically significant relationship between undergraduate school prestige ('prestige') and admission ('admit'). H0 = There exists an insignificant relationship between variables of undergraduate school prestige ('prestige') and admission ('admit').

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

3. Advanced: Impute missing data


In [75]:
#utilized this stackoverflow.com resource to attempt to impute missing data 
#(http://stackoverflow.com/questions/21050426/pandas-impute-nans)

#data imputation for variable 'admit' 

#first commented out line of code will not run. Had errors with "keys" in ...df_raw.groupby('keys')...
#df_raw['admit'].fillna(df_raw.groupby('keys')['admit'].transform('mean'), inplace = True)
df_raw['admit'].fillna(df_raw['admit'].mean(), inplace = True)

In [ ]: