In [1]:
#imports
from __future__ import division
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pylab as pl
%matplotlib inline
import seaborn as sns
In [2]:
#Read in data from source
df_raw = pd.read_csv("../assets/admissions.csv")
print df_raw.head()
In [3]:
df_raw.count()
Out[3]:
In [4]:
df_raw.count().sum()
Out[4]:
Answer: There are 1595 observations in our data set
In [47]:
summary_stats_admissions = df_raw.describe()
summary_stats_admissions
Out[47]:
In [154]:
# Compute quantiles of gre
gre_quantiles = pd.qcut(df_raw['gre'], 4)
gre_quantiles.value_counts().sort_index()
Out[154]:
In [155]:
# Compute quantiles of gpa
gpa_quantiles = pd.qcut(df_raw['gpa'], 4)
gpa_quantiles.value_counts().sort_index()
Out[155]:
In [156]:
# What is the sample size distribution among quantiles of gre and gpa by prestige level?
df_raw.pivot_table(['gre'], ['admit', gre_quantiles], [gpa_quantiles, 'prestige'], aggfunc=[len])
Out[156]:
In [157]:
# What is the standard deviation distribution among quantiles of gre and gpa by prestige level?
df_raw.pivot_table(['gre'], ['admit', gre_quantiles], [gpa_quantiles, 'prestige'], aggfunc=[np.std])
Out[157]:
In [9]:
# Inspect gre, gpa std
df_raw.std()[['gre', 'gpa']]
Out[9]:
Answer: Because the GRE consists of three parts: quantitative reasoning, verbal reasoning, and analytical writing and because test takers each have different academic degrees, there is going to be larger variation across testing for knowledge in these areas. Specifically, I would expect skills in quantitative & verbal reasoning, and analytical writing to vary by academic institution, academic college, academic department, academic degree program, and degree program specialization. This is because most academic institutions, including their colleges, departments, and degree programs tend to put varied emphasis on having different quantitative, verbal, and writing skills.
e.g., Theatre arts majors, might not need to have a strong background in quantitative reasoning, so they may not have to take classes focusing on quantitative reasoning (unless they went to strong engineering college). Similary, a computer engineering major may not need to have a strong background in English literature, so the emphasis is not on building analytical writing skills (unless they went to a strong liberal arts college). Therefore, I would expect more variation in GRE scores due to the many different degree programs having a different focus on acquiring different skills, and in varying amounts.
In [10]:
# Which columns have missing data?
df_raw.isnull().sum()
Out[10]:
In [11]:
# Which records are null?
df_raw[df_raw.isnull().any(axis=1)]
Out[11]:
In [12]:
# What is shape of dataframe before dropping records?
shape_before_dropna = df_raw.shape
print(shape_before_dropna)
In [13]:
# Inspect shape before dropping missing values
shape_after_dropna = df_raw.dropna(how='any').shape
print(shape_after_dropna)
In [14]:
# Now, drop missing values
df_raw.dropna(how='any', inplace=True)
Answer: Before dropping missing values the dataframe shape was (400, 4). After dropping missing values the dataframe
shape was (397, 3). The isnull() method showed that there were three records having any values missing in a row (axis=1).
In [178]:
#boxplot 1
#df_raw.boxplot('gre')
sns.boxplot('gre', data=df_raw)
sns.plt.title('GRE: Box and Whiskers Plot')
Out[178]:
In [179]:
#boxplot 2
#df_raw.boxplot('gpa')
sns.boxplot('gpa', data=df_raw)
sns.plt.title('GPA: Box and Whiskers Plot')
Out[179]:
Answer: They show the data's spread, or how far from the center the data tend to range. Specifically, boxplots show the middle fifty percent of the data, and its range.
The idea is to divide the data into four equal groups and see how far apart the extreme groups are.The data is first divided into two equal high and low groups at the median, which is called the second quartile, or Q2. The median of the low group is called the first quartile or Q1. The median of the high group is the third quartile, or Q3. The box's ends are the quartiles Q1 and Q3 respectively. The box's midline is the quartile Q2, which is the median of the data. The interquartile range (IQR) is the distance between the box's ends: the distance between the third quartile and the first quartile, or Q3-Q1. These plots are especially good for showing off differences between the high and low groups, as well as outliers.
In [20]:
# plot the distribution of each variable
df_raw.plot(kind='density', subplots=True, layout=(2, 2), sharex=False)
plt.show()
The Admit distribtion is bimodal (has two modes, 0, and 1) as expected. Both the GRE distribution and GPA distribution are approximately symmetrical. The Prestige distribution is multimodal (has four modes, 1, 2, 3, 4) as expected.
In [121]:
# Test for normality using the Kolmogorov-Smirnov Test
# GRE normal?
print('GRE: ', stats.kstest(df_raw.gre, 'norm'))
print('Kurtosis: ', df_raw.gre.kurt())
print('Skew: ', df_raw.gre.skew())
print('~~~~~~~~~~~')
# GPA normal?
print('GPA : ', stats.kstest(df_raw.gpa, 'norm'))
print('Kurtosis: ', df_raw.gpa.kurt())
print('Skew: ', df_raw.gpa.skew())
print('~~~~~~~~~~~')
# Admit normal?
print('Admit: ', stats.kstest(df_raw.admit, 'norm'))
print('Kurtosis: ', df_raw.admit.kurt())
print('Skew: ', df_raw.admit.skew())
print('~~~~~~~~~~~')
# Prestige normal?
print('Prestige: ', stats.kstest(df_raw.prestige, 'norm'))
print('Kurtosis: ', df_raw.prestige.kurt())
print('Skew: ', df_raw.prestige.skew())
Answer: No. We would not meet that requirement. Because according to the Kolmogorov-Smirnov test, there is zero percent chance that the test statistic values of D we observed for GRE, GPA, Admit, and Prestige respectively (1.0, 0.9897, 0.5, and 0.8413) could have arisen if the data had been drawn from a normal distribution. We therefore reject the hypothesis at the 95% confidence level that the data were drawn from a normal distribution and conclude that the data is not normally distributed.
Answer: Yes, it needs correction. It needs correction because the distributions are not normal. They are both left-skewed and leptokurtic. I plan to remove outliers and log transform the data.
In [49]:
# GRE IQR
q3_gre = summary_stats_admissions.gre['75%']
q1_gre = summary_stats_admissions.gre['25%']
iqr_gre = q3_gre - q1_gre
low_fence_gre = q1_gre - 1.5*iqr_gre
high_fence_gre = q3_gre + 1.5*iqr_gre
print("GRE IQR: ", iqr_gre)
print("GRE low fence: ", low_fence_gre)
print("GRE high fence: ", high_fence_gre)
In [126]:
# Find GRE outliers
print('Number of outliers: ', df_raw[(df_raw.gre < low_fence_gre) | (df_raw.gre > high_fence_gre)].shape[0])
print('These are the outliers: ')
df_raw[(df_raw.gre < low_fence_gre) | (df_raw.gre > high_fence_gre)]
Out[126]:
In [127]:
# Remove GRE outliers
print('Shape before outlier removal is: ', df_raw.shape)
df = df_raw[(df_raw.gre >= low_fence_gre) & (df_raw.gre <= high_fence_gre)]
print('Shape after outlier removal is: ', df.shape)
In [140]:
# Plot to visually inspect distribution, still looks skewed
df.gre.plot.density()
plt.title('GRE density')
plt.show()
In [50]:
# GPA IQR
q3_gpa = summary_stats_admissions.gpa['75%']
q1_gpa = summary_stats_admissions.gpa['25%']
iqr_gpa = q3_gpa - q1_gpa
low_fence_gpa = q1_gpa - 1.5*iqr_gpa
high_fence_gpa = q3_gpa + 1.5*iqr_gpa
print("GPA IQR: ", round(iqr_gpa, 1))
print("GPA low fence: ", round(low_fence_gpa, 1))
print("GPA high fence: ", round(high_fence_gpa, 1))
In [129]:
# Now, find GPA Outliers
print('Number of outliers: ', df[(df.gpa < low_fence_gpa) | (df.gpa > high_fence_gpa)].shape[0])
print('These are the outliers: ')
df[(df.gpa < low_fence_gpa) | (df.gpa > high_fence_gpa)]
Out[129]:
In [130]:
print('Shape before outlier removal is: ', df.shape)
df = df[(df.gpa >= low_fence_gpa) & (df.gpa <= high_fence_gpa)]
print('Shape after outlier removal is: ', df.shape)
In [142]:
# Plot to visually inspect distribution, still looks skewed!
df.gpa.plot.density()
plt.title('GPA density')
plt.show()
In [186]:
# Removed outliers: re-test for normality using the Kolmogorov-Smirnov Test
# Observation: skew got better, kurtosis got worse!
# GRE
print('GRE: ', stats.kstest(df.gre, 'norm'))
print('Kurtosis: ', df.gre.kurt())
print('Skew: ', df.gre.skew())
print('~~~~~~~~~~~')
# GPA
print('GPA : ', stats.kstest(df.gpa, 'norm'))
print('Kurtosis: ', df.gpa.kurt())
print('Skew: ', df.gpa.skew())
In [168]:
# Transform GRE distribution to standard normal
sns.distplot( (df.gre - df.gre.mean()) / df.gre.std(), bins=5, kde_kws={'bw':1} )
sns.plt.title('GRE to Standard Normal')
sns.plt.show()
In [170]:
# Transform GPA distribution to standard normal
sns.distplot( (df.gpa - df.gpa.mean()) / df.gpa.std(), bins=10, kde_kws={'bw':1} )
sns.plt.title('GPA to Standard Normal')
sns.plt.show()
In [185]:
# Log transform the data: re-test for normality using the Kolmogorov-Smirnov Test
# Observation: Skew got worse, Kurtosis got better
# GRE
print('GRE: ', stats.kstest(np.log(df.gre), 'norm'))
print('Kurtosis: ', np.log(df.gre).kurt())
print('Skew: ', np.log(df.gre).skew())
print('~~~~~~~~~~~')
# GPA
print('GPA : ', stats.kstest(np.log(df.gpa), 'norm'))
print('Kurtosis: ', np.log(df.gpa).kurt())
print('Skew: ', np.log(df.gpa).skew())
Answer: I don't know how to correct for the skewness and kurotis inherent in this data set.
But here's what I found:
Answer: GPA and GRE are potentially collinear. i.e., They are moderately positively correlated.
In [28]:
# create a correlation matrix for the data
df_raw.corr()
Out[28]:
In [29]:
sns.heatmap(df_raw.corr(), annot=True, cmap='RdBu')
Out[29]:
In [30]:
pd.scatter_matrix(df_raw)
plt.show()
Answer: GPA and GRE are potentially collinear. i.e., They are moderately positively correlated.
Answer: First, inspect the spread of the data both graphically and in tabular form. Look at the counts in each of the factor-level combinations to see how each is represented. Generate summary statistics on each predictor to quality check the count, variance, std, quartiles, and number of Null values. Identify null values if any and remove them or impute values. Create both boxplots, histograms, and density plots to inspect the shape of the distributions. Check for outliers. Identify which data instances are outliers and remove them. Check the data distributions for normality by computing the kstest. At the same time evalute distribution skew, and kurtosis. If necessary, log transform the data to to help transform a non-normally distributed data distribution to a normal one (This won't always work, but its worth trying!). Finally, last but not least, rescale your data if necessary to the standard normal and then graph it to observe how your data fits the standard normal distribution.
Answer: We hypothesize that GRE, GPA, and prestige may be used to predict if a student will be admitted to graduate school.
Mark boxes with an 'X'
| Requirements | Incomplete (0) | Does Not Meet Expectations (1) | Meets Expectations (2) | Exceeds Expectations (3) |
|---|---|---|---|---|
| Read in your dataset, determine how many samples are present, and ID any missing data | X | |||
| Create a table of descriptive statistics for each of the variables (n, mean, median, standard deviation) | X | |||
| Describe the distributions of your data | X | |||
| Plot box plots for each variable | X | |||
| Create a covariance matrix | X | |||
| Determine any issues or limitations, based on your exploratory analysis | X | |||
| Outline exploratory analysis methods | X |
Notes:
Overall, excellent work. I'm very impressed with your knowledge of statistics and your application in this context. Remember to keep your graphs organized and concise - don't include more than you need or the audience can get overwhelmed. Also always include an explanation with your visuals. Excellent job, you're very much on the right track.
Based on the requirements, you can earn a maximum of 21 points on this project.
In [ ]: