Student Admissions

Predicting student admissions to graduate school at UCLA based on GRE Scores, GPA Scores, and class rank

Supervised Learning. Classification

Based on the Predicting Student Admissions mini project of the Udacity's Artificial Intelligence Nanodegree



In [1]:

    
%matplotlib inline

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import keras
import helper

helper.reproducible(seed=0)
sns.set()









    



Using TensorFlow backend.

Load and prepare the data



In [2]:

    
data_path = 'data/student_admissions.csv'
df = pd.read_csv(data_path)
df.head()



In [3]:

    
df.describe()



In [4]:

    
targets = ['admit']
features = ['gre', 'gpa', 'rank']

categorical = ['admit', 'rank']
numerical = ['gre', 'gpa']

# NaN values
df.fillna(df[numerical].median(), inplace=True) # NaN from numerical feature replaced by median
df.dropna(axis='index', how='any', inplace = True)    # NaN from categorical feature: delete row

df_visualize = df  # copy for model visualization
df.shape









    Out[4]:





(399, 4)

Visualize data



In [5]:

    
def plot_data(dataf, hue="admit"):
    """ Custom plot for this project """
    g = sns.FacetGrid(dataf, col="rank",  hue=hue)
    g = (g.map(plt.scatter, "gre", "gpa", edgecolor="w").add_legend())
    return g
    
plot_data(df)









    Out[5]:





<seaborn.axisgrid.FacetGrid at 0x7fdcba2097b8>

Create dummy variables



In [6]:

    
dummies = pd.get_dummies(df['rank'], prefix='rank', drop_first=False)
df = pd.concat([df, dummies], axis=1)
df = df.drop('rank', axis='columns')
df.head()

Scale numerical features



In [7]:

    
# Store scalings in a dictionary so we can convert back later
scaled_features = {}
for f in numerical:
    mean, std = df[f].mean(), df[f].std()
    scaled_features[f] = [mean, std]
    df.loc[:, f] = (df[f] - mean)/std
df.head()

Split the data into training and test sets



In [8]:

    
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=9)

# Separate the data into features and targets (x=features, y=targets)
x_train, y_train = train.drop(targets, axis=1).values, train[targets].values
x_test, y_test = test.drop(targets, axis=1).values, test[targets].values

One-hot encoding the target



In [9]:

    
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print("Training set: \t x-shape = {} \t y-shape = {}".format(x_train.shape ,y_train.shape))
print("Test set: \t x-shape = {} \t y-shape = {}".format(x_test.shape ,y_test.shape))









    



Training set: 	 x-shape = (319, 6) 	 y-shape = (319, 2)
Test set: 	 x-shape = (80, 6) 	 y-shape = (80, 2)

Deep Neural Network



In [10]:

    
from keras.models import Sequential
from keras.layers.core import Dense, Dropout

input_nodes = x_train.shape[1]*8
weights = keras.initializers.RandomNormal(stddev=0.1)

model = Sequential()
model.add(Dense(input_nodes, input_dim=x_train.shape[1], activation='relu'))
model.add(Dropout(.2))
model.add(Dense(2,activation='softmax'))
model.summary()

model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print('\nTraining ....')
callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=0)]
%time history = model.fit(x_train, y_train, epochs=1000, batch_size=64, verbose=0, \
                          validation_split=0.25, callbacks=callbacks)
helper.show_training(history)

model_path = os.path.join("models", "student_admissions.h5")
model.save(model_path)
print("\nModel saved at",model_path)









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 48)                336       
_________________________________________________________________
dropout_1 (Dropout)          (None, 48)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 98        
=================================================================
Total params: 434
Trainable params: 434
Non-trainable params: 0
_________________________________________________________________

Training ....
CPU times: user 1.19 s, sys: 120 ms, total: 1.31 s
Wall time: 1.15 s






    












    



Training loss:  	0.5880
Validation loss: 	0.5287

Training accuracy: 	0.699
Validation accuracy:	0.762

Model saved at models/student_admissions.h5

Evaluate the model



In [11]:

    
model = keras.models.load_model(model_path)
print("Model loaded:", model_path)

score = model.evaluate(x_test, y_test, verbose=0)
print("\nTest Accuracy: {:.2f}".format(score[1]))









    



Model loaded: models/student_admissions.h5

Test Accuracy: 0.75

Visualize the model



In [12]:

    
predictions = model.predict(df.drop(targets, axis=1).values)
predictions = np.argmax(predictions, axis=1)

df_visualize['predicted'] = predictions 

plot_data(df_visualize).fig.suptitle('Actual')
plot_data(df_visualize, hue='predicted').fig.suptitle('Model')









    Out[12]:





<matplotlib.text.Text at 0x7fdc39c30be0>

A bit overfitting can be appreciated in rank 2 sometimes. More information can be extracted when looking at the predicted probabilities instead of the binary accepted-rejected result shown here.

Make predictions



In [13]:

    
def predict_admission(student):
    # student_data: {id: [gre, gpa, 'rank1, rank2, rank3, rank4]}

    print('Admission Probabilities: \n')

    for key, value in student.items():
        p_name = key
        single_data = value

        # normalize data
        for idx, f in enumerate(numerical):
            single_data[idx] = (single_data[idx] - scaled_features[f][0]) / scaled_features[f][1]

        # make prediction
        single_pred = model.predict(np.array([single_data]))
        print('{}: \t {:.0f}%\n'.format(p_name, single_pred[0, 1] * 100))


df_visualize.describe()



In [14]:

    
# student_data: {id: [gre, gpa, 'rank1, rank2, rank3, rank4]}
new_students =  {'High scores rank-1':  [730, 3.83, 1, 0, 0, 0],
                 'High scores rank-2':  [730, 3.83, 0, 1, 0, 0],
                 'High scores rank-3':  [730, 3.83, 0, 0, 1, 0],
                 'High scores rank-4':  [730, 3.83, 0, 0, 0, 1],   
                 'Avg scores rank-1':  [588, 3.4, 1, 0, 0, 0],
                 'Avg scores rank-2':  [588, 3.4, 0, 1, 0, 0],
                 'Avg scores rank-3':  [588, 3.4, 0, 0, 1, 0],
                 'Avg scores rank-4':  [588, 3.4, 0, 0, 0, 1],
                 }
predict_admission(new_students)









    



Admission Probabilities: 

High scores rank-1: 	 65%

High scores rank-2: 	 53%

High scores rank-3: 	 35%

High scores rank-4: 	 35%

Avg scores rank-1: 	 49%

Avg scores rank-2: 	 40%

Avg scores rank-3: 	 28%

Avg scores rank-4: 	 26%

The predictions confirm that rank is the most influential feature in determining the admission, which seems reasonable. The absolute grades of the students are more relevant for rank-1 students (Q1).

	admit	gre	gpa	rank
0	0	380.0	3.61	3.0
1	1	660.0	3.67	3.0
2	1	800.0	4.00	1.0
3	1	640.0	3.19	4.0
4	0	520.0	2.93	4.0

	admit	gre	gpa	rank
count	400.000000	398.000000	398.00000	399.000000
mean	0.317500	588.040201	3.39093	2.486216
std	0.466087	115.628513	0.38063	0.945333
min	0.000000	220.000000	2.26000	1.000000
25%	0.000000	520.000000	3.13000	2.000000
50%	0.000000	580.000000	3.39500	2.000000
75%	1.000000	660.000000	3.67000	3.000000
max	1.000000	800.000000	4.00000	4.000000

	admit	gre	gpa	rank_1.0	rank_3.0	rank_4.0
0	0	380.0	3.61	0	1	0
1	1	660.0	3.67	0	1	0
2	1	800.0	4.00	1	0	0
3	1	640.0	3.19	0	0	1
4	0	520.0	2.93	0	0	1

	admit	gre	gpa	rank_1.0	rank_3.0	rank_4.0
0	0	-1.800426	0.576244	0	1	0
1	1	0.625329	0.734075	0	1	0
2	1	1.838206	1.602149	1	0	0
3	1	0.452061	-0.528578	0	0	1
4	0	-0.587548	-1.212515	0	0	1

	admit	gre	gpa	rank	predicted
count	399.000000	399.000000	399.000000	399.000000	399.000000
mean	0.315789	587.819549	3.390940	2.486216	0.172932
std	0.465413	115.428011	0.380152	0.945333	0.378664
min	0.000000	220.000000	2.260000	1.000000	0.000000
25%	0.000000	520.000000	3.130000	2.000000	0.000000
50%	0.000000	580.000000	3.395000	2.000000	0.000000
75%	1.000000	660.000000	3.670000	3.000000	0.000000
max	1.000000	800.000000	4.000000	4.000000	1.000000