Project 2

In this project, you will implement the exploratory analysis plan developed in Project 1. This will lay the groundwork for our our first modeling exercise in Project 3.

Step 1: Load the python libraries you will need for this project



In [19]:

    
#imports
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline

Step 2: Read in your data set



In [20]:

    
#Read in data from source 
df_raw = pd.read_csv("../assets/admissions.csv")
print df_raw.head()









    



   admit  gre   gpa  prestige
0      0  380  3.61         3
1      1  660  3.67         3
2      1  800  4.00         1
3      1  640  3.19         4
4      0  520  2.93         4

Questions

Question 1. How many observations are in our dataset?



In [21]:

    
df_raw['admit'].count()
df_raw['gpa'].count()
df_raw.shape
rows,columns = df_raw.shape
print(rows)
print(columns)

Answer: 400 observations. These 400 observations are displayed within 4 rows of 400 observations each.

Question 2. Create a summary table



In [22]:

    
#function
def summary_table():
    #creates a summary table for df_raw using .describe()
    df_raw.describe()
    return x
    print X



In [23]:

    
df_raw.describe()

Question 3. Why would GRE have a larger STD than GPA?

Answer: The GRE variable has a larger 'std' value since the range of GRE scores varies from 220 to 800 while the range for GPA varies from 2.26 to 4.00.

Question 4. Drop data points with missing data



In [24]:

    
df_raw.dropna() 
#drops any missing data rows from admissions.csv dataset 
#returns 397 observations (complete observation rows) across 4 columns
#3 rows had missing, incomplete, NaN data present









    Out[24]:






  
    
      
      admit
      gre
      gpa
      prestige
    
  
  
    
      0
      0
      380
      3.61
      3
    
    
      1
      1
      660
      3.67
      3
    
    
      2
      1
      800
      4.00
      1
    
    
      3
      1
      640
      3.19
      4
    
    
      4
      0
      520
      2.93
      4
    
    
      5
      1
      760
      3.00
      2
    
    
      6
      1
      560
      2.98
      1
    
    
      7
      0
      400
      3.08
      2
    
    
      8
      1
      540
      3.39
      3
    
    
      9
      0
      700
      3.92
      2
    
    
      10
      0
      800
      4.00
      4
    
    
      11
      0
      440
      3.22
      1
    
    
      12
      1
      760
      4.00
      1
    
    
      13
      0
      700
      3.08
      2
    
    
      14
      1
      700
      4.00
      1
    
    
      15
      0
      480
      3.44
      3
    
    
      16
      0
      780
      3.87
      4
    
    
      17
      0
      360
      2.56
      3
    
    
      18
      0
      800
      3.75
      2
    
    
      19
      1
      540
      3.81
      1
    
    
      20
      0
      500
      3.17
      3
    
    
      21
      1
      660
      3.63
      2
    
    
      22
      0
      600
      2.82
      4
    
    
      23
      0
      680
      3.19
      4
    
    
      24
      1
      760
      3.35
      2
    
    
      25
      1
      800
      3.66
      1
    
    
      26
      1
      620
      3.61
      1
    
    
      27
      1
      520
      3.74
      4
    
    
      28
      1
      780
      3.22
      2
    
    
      29
      0
      520
      3.29
      1
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      370
      1
      540
      3.77
      2
    
    
      371
      1
      680
      3.76
      3
    
    
      372
      1
      680
      2.42
      1
    
    
      373
      1
      620
      3.37
      1
    
    
      374
      0
      560
      3.78
      2
    
    
      375
      0
      560
      3.49
      4
    
    
      376
      0
      620
      3.63
      2
    
    
      377
      1
      800
      4.00
      2
    
    
      378
      0
      640
      3.12
      3
    
    
      379
      0
      540
      2.70
      2
    
    
      380
      0
      700
      3.65
      2
    
    
      381
      1
      540
      3.49
      2
    
    
      382
      0
      540
      3.51
      2
    
    
      383
      0
      660
      4.00
      1
    
    
      384
      1
      480
      2.62
      2
    
    
      385
      0
      420
      3.02
      1
    
    
      386
      1
      740
      3.86
      2
    
    
      387
      0
      580
      3.36
      2
    
    
      388
      0
      640
      3.17
      2
    
    
      389
      0
      640
      3.51
      2
    
    
      390
      1
      800
      3.05
      2
    
    
      391
      1
      660
      3.88
      2
    
    
      392
      1
      600
      3.38
      3
    
    
      393
      1
      620
      3.75
      2
    
    
      394
      1
      460
      3.99
      3
    
    
      395
      0
      620
      4.00
      2
    
    
      396
      0
      560
      3.04
      3
    
    
      397
      0
      460
      2.63
      2
    
    
      398
      0
      700
      3.65
      2
    
    
      399
      0
      600
      3.89
      3
    
  

397 rows × 4 columns

Question 5. Confirm that you dropped the correct data. How can you tell?

Answer: Code in question one returned 400 observations across 4 rows. Culled data using '.dropna()' method returns 397 observational rows, implying that three rows had been removed due to NaN data being present.

Question 6. Create box plots for GRE and GPA



In [25]:

    
#boxplot for GRE column data
df_raw.boxplot(column = 'gre', return_type = 'axes')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a27be50>



In [26]:

    
#boxplot for GPA column data
df_raw.boxplot(column = 'gpa', return_type = 'axes')









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a7abcd0>

Question 7. What do this plots show?

Answer: GRE Boxplot: The mean for this variable lies just south of 600 (around 580) and the interquartile range lies between 650 and 510 as indicated by the blue square. The box plot displays a significant outlier at 300 which has not been included into the range as it falls well outside the acceptable standard deviation from the mean. Further, this value is below the lower extreme of variable GPA.

GPA Boxplot: The mean GPA value falls right at ~3.40 with the interquartile range falling between ~3.64 at the upper quartile and ~3.18 at the lower quartile. The lower extreme of this data is right at 2.4 while the upper extreme extends beyond 4.00 despite the maximum of this data being 4.00.

Question 8. Describe each distribution



In [63]:

    
# distribution plot of 'admit' variable with mean
df_raw.admit.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.admit.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=2.0,
           linewidth=4.0)









    Out[63]:





<matplotlib.collections.LineCollection at 0x125453510>



In [64]:

    
# distribution plot of 'gre' variable with mean
df_raw.gre.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.gre.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.0035,
           linewidth=4.0)









    Out[64]:





<matplotlib.collections.LineCollection at 0x1255c0f90>



In [65]:

    
# distribution plot of 'gpa' variable with mean
df_raw.gpa.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.gpa.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=1.0,
           linewidth=4.0)









    Out[65]:





<matplotlib.collections.LineCollection at 0x125113750>



In [66]:

    
# distribution plot of 'prestige' variable with mean
df_raw.prestige.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
# 

plt.vlines(df_raw.prestige.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.6,
           linewidth=4.0)









    Out[66]:





<matplotlib.collections.LineCollection at 0x126296090>

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer: We would not meet that requirement as only the variable 'gre' displays itself in a quasi normal distribution. The variables for admit, gpa, and prestige are abnormally distributed.

Question 10. Does this distribution need correction? If so, why? How?

Answer: Yes, this data does need to be corrected. If we are to compare these variables through linear regression or other statistics inferential tools, the data must be normalized in order to conform to a more normal distribution. We can accomplish this by using breating a new dataframe like so: (df_norm = (df_raw - df_raw.mean()) / (df_raw.max() - df.min())

Sourced solution for normalization: http://stackoverflow.com/questions/12525722/normalize-data-in-pandas

Question 11. Which of our variables are potentially colinear?



In [67]:

    
# correlation matrix for variables in df_raw
df_raw.corr()

Question 12. What did you find?

Answer: The strongest, most interesting correlation that exists between two variables exist between variables 'admit' and 'prestige'. The two variables are negatively correlated (-0.241). This would imply that as the prestige of your school increases by one unit your likelyhood of admission to UCLA decreases by a factor of -0.241 (or -25%), holding all other variables constant. GPA and GRE variables are positively correlated in that as your GPA increases, your GRE score increases by a factor of 0.382408.

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Answer: I will examine the relationship between variables 'prestige' and 'admit' using the admissions.csv database in order to determine if the two varibles are correlated and if they are causally linked. Further, I will determine if this relationship is statistically significant.

Question 14. What is your hypothesis?

Answer: H1 = There exists a statistically significant relationship between undergraduate school prestige ('prestige') and admission ('admit'). H0 = There exists an insignificant relationship between variables of undergraduate school prestige ('prestige') and admission ('admit').

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

3. Advanced: Impute missing data



In [75]:

    
#utilized this stackoverflow.com resource to attempt to impute missing data 
#(http://stackoverflow.com/questions/21050426/pandas-impute-nans)

#data imputation for variable 'admit' 

#first commented out line of code will not run. Had errors with "keys" in ...df_raw.groupby('keys')...
#df_raw['admit'].fillna(df_raw.groupby('keys')['admit'].transform('mean'), inplace = True)
df_raw['admit'].fillna(df_raw['admit'].mean(), inplace = True)



In [ ]:

	admit	gre	gpa	prestige
count	400.000000	398.000000	398.00000	399.000000
mean	0.317500	588.040201	3.39093	2.486216
std	0.466087	115.628513	0.38063	0.945333
min	0.000000	220.000000	2.26000	1.000000
25%	0.000000	520.000000	3.13000	2.000000
50%	0.000000	580.000000	3.39500	2.000000
75%	1.000000	660.000000	3.67000	3.000000
max	1.000000	800.000000	4.00000	4.000000

	admit	gre	gpa	prestige
admit	1.000000	0.182919	0.175952	-0.241355
gre	0.182919	1.000000	0.382408	-0.124533
gpa	0.175952	0.382408	1.000000	-0.059031
prestige	-0.241355	-0.124533	-0.059031	1.000000

	admit	gre	gpa	prestige
0	0	380	3.61	3
1	1	660	3.67	3
2	1	800	4.00	1
3	1	640	3.19	4
4	0	520	2.93	4
5	1	760	3.00	2
6	1	560	2.98	1
7	0	400	3.08	2
8	1	540	3.39	3
9	0	700	3.92	2
10	0	800	4.00	4
11	0	440	3.22	1
12	1	760	4.00	1
13	0	700	3.08	2
14	1	700	4.00	1
15	0	480	3.44	3
16	0	780	3.87	4
17	0	360	2.56	3
18	0	800	3.75	2
19	1	540	3.81	1
20	0	500	3.17	3
21	1	660	3.63	2
22	0	600	2.82	4
23	0	680	3.19	4
24	1	760	3.35	2
25	1	800	3.66	1
26	1	620	3.61	1
27	1	520	3.74	4
28	1	780	3.22	2
29	0	520	3.29	1
...	...	...	...	...
370	1	540	3.77	2
371	1	680	3.76	3
372	1	680	2.42	1
373	1	620	3.37	1
374	0	560	3.78	2
375	0	560	3.49	4
376	0	620	3.63	2
377	1	800	4.00	2
378	0	640	3.12	3
379	0	540	2.70	2
380	0	700	3.65	2
381	1	540	3.49	2
382	0	540	3.51	2
383	0	660	4.00	1
384	1	480	2.62	2
385	0	420	3.02	1
386	1	740	3.86	2
387	0	580	3.36	2
388	0	640	3.17	2
389	0	640	3.51	2
390	1	800	3.05	2
391	1	660	3.88	2
392	1	600	3.38	3
393	1	620	3.75	2
394	1	460	3.99	3
395	0	620	4.00	2
396	0	560	3.04	3
397	0	460	2.63	2
398	0	700	3.65	2
399	0	600	3.89	3