In [19]:
#imports
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline
In [20]:
#Read in data from source
df_raw = pd.read_csv("../assets/admissions.csv")
print df_raw.head()
In [21]:
df_raw['admit'].count()
df_raw['gpa'].count()
df_raw.shape
rows,columns = df_raw.shape
print(rows)
print(columns)
Answer: 400 observations. These 400 observations are displayed within 4 rows of 400 observations each.
In [22]:
#function
def summary_table():
#creates a summary table for df_raw using .describe()
df_raw.describe()
return x
print X
In [23]:
df_raw.describe()
Out[23]:
Answer: The GRE variable has a larger 'std' value since the range of GRE scores varies from 220 to 800 while the range for GPA varies from 2.26 to 4.00.
In [24]:
df_raw.dropna()
#drops any missing data rows from admissions.csv dataset
#returns 397 observations (complete observation rows) across 4 columns
#3 rows had missing, incomplete, NaN data present
Out[24]:
Answer: Code in question one returned 400 observations across 4 rows. Culled data using '.dropna()' method returns 397 observational rows, implying that three rows had been removed due to NaN data being present.
In [25]:
#boxplot for GRE column data
df_raw.boxplot(column = 'gre', return_type = 'axes')
Out[25]:
In [26]:
#boxplot for GPA column data
df_raw.boxplot(column = 'gpa', return_type = 'axes')
Out[26]:
Answer: GRE Boxplot: The mean for this variable lies just south of 600 (around 580) and the interquartile range lies between 650 and 510 as indicated by the blue square. The box plot displays a significant outlier at 300 which has not been included into the range as it falls well outside the acceptable standard deviation from the mean. Further, this value is below the lower extreme of variable GPA.
GPA Boxplot: The mean GPA value falls right at ~3.40 with the interquartile range falling between ~3.64 at the upper quartile and ~3.18 at the lower quartile. The lower extreme of this data is right at 2.4 while the upper extreme extends beyond 4.00 despite the maximum of this data being 4.00.
In [63]:
# distribution plot of 'admit' variable with mean
df_raw.admit.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
#
plt.vlines(df_raw.admit.mean(), # Plot black line at mean
ymin=0,
ymax=2.0,
linewidth=4.0)
Out[63]:
In [64]:
# distribution plot of 'gre' variable with mean
df_raw.gre.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
#
plt.vlines(df_raw.gre.mean(), # Plot black line at mean
ymin=0,
ymax=0.0035,
linewidth=4.0)
Out[64]:
In [65]:
# distribution plot of 'gpa' variable with mean
df_raw.gpa.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
#
plt.vlines(df_raw.gpa.mean(), # Plot black line at mean
ymin=0,
ymax=1.0,
linewidth=4.0)
Out[65]:
In [66]:
# distribution plot of 'prestige' variable with mean
df_raw.prestige.plot(kind = 'density', sharex = False, sharey = False, figsize = (10,4));plt.legend(loc='best')
#
plt.vlines(df_raw.prestige.mean(), # Plot black line at mean
ymin=0,
ymax=0.6,
linewidth=4.0)
Out[66]:
Answer: We would not meet that requirement as only the variable 'gre' displays itself in a quasi normal distribution. The variables for admit, gpa, and prestige are abnormally distributed.
Answer: Yes, this data does need to be corrected. If we are to compare these variables through linear regression or other statistics inferential tools, the data must be normalized in order to conform to a more normal distribution. We can accomplish this by using breating a new dataframe like so: (df_norm = (df_raw - df_raw.mean()) / (df_raw.max() - df.min())
Sourced solution for normalization: http://stackoverflow.com/questions/12525722/normalize-data-in-pandas
In [67]:
# correlation matrix for variables in df_raw
df_raw.corr()
Out[67]:
Answer: The strongest, most interesting correlation that exists between two variables exist between variables 'admit' and 'prestige'. The two variables are negatively correlated (-0.241). This would imply that as the prestige of your school increases by one unit your likelyhood of admission to UCLA decreases by a factor of -0.241 (or -25%), holding all other variables constant. GPA and GRE variables are positively correlated in that as your GPA increases, your GRE score increases by a factor of 0.382408.
Answer: I will examine the relationship between variables 'prestige' and 'admit' using the admissions.csv database in order to determine if the two varibles are correlated and if they are causally linked. Further, I will determine if this relationship is statistically significant.
Answer: H1 = There exists a statistically significant relationship between undergraduate school prestige ('prestige') and admission ('admit'). H0 = There exists an insignificant relationship between variables of undergraduate school prestige ('prestige') and admission ('admit').
In [75]:
#utilized this stackoverflow.com resource to attempt to impute missing data
#(http://stackoverflow.com/questions/21050426/pandas-impute-nans)
#data imputation for variable 'admit'
#first commented out line of code will not run. Had errors with "keys" in ...df_raw.groupby('keys')...
#df_raw['admit'].fillna(df_raw.groupby('keys')['admit'].transform('mean'), inplace = True)
df_raw['admit'].fillna(df_raw['admit'].mean(), inplace = True)
In [ ]: