在该 notebook 中,我们基于以下三条数据预测了加州大学洛杉矶分校的研究生录取情况:
GRE 分数(测试)即 GRE Scores (Test)
GPA 分数(成绩)即 GPA Scores (Grades)
评级(1-4)即 Class rank (1-4)
数据集来源:http://www.ats.ucla.edu/
为了加载数据并很好地进行格式化,我们将使用两个非常有用的包,即 Pandas 和 Numpy。 你可以在这里此文档:
In [ ]:
# Importing pandas and numpy
import pandas as pd
import numpy as np
# Reading the csv file into a pandas DataFrame
data = pd.read_csv('student_data.csv')
# Printing out the first 10 rows of our data
data[:10]
In [ ]:
# Importing matplotlib
import matplotlib.pyplot as plt
# Function to help us plot
def plot_points(data):
X = np.array(data[["gre","gpa"]])
y = np.array(data["admit"])
admitted = X[np.argwhere(y==1)]
rejected = X[np.argwhere(y==0)]
plt.scatter([s[0][0] for s in rejected], [s[0][1] for s in rejected], s = 25, color = 'red', edgecolor = 'k')
plt.scatter([s[0][0] for s in admitted], [s[0][1] for s in admitted], s = 25, color = 'cyan', edgecolor = 'k')
plt.xlabel('Test (GRE)')
plt.ylabel('Grades (GPA)')
# Plotting the points
plot_points(data)
plt.show()
粗略地说,它看起来像是,成绩 (grades) 和测试(test) 分数 高的学生通过了,而得分低的学生却没有,但数据并没有如我们所希望的那样,很好地分离。 也许将评级 (rank) 考虑进来会有帮助? 接下来我们将绘制 4 个图,每个图代表一个级别。
In [ ]:
# Separating the ranks
data_rank1 = data[data["rank"]==1]
data_rank2 = data[data["rank"]==2]
data_rank3 = data[data["rank"]==3]
data_rank4 = data[data["rank"]==4]
# Plotting the graphs
plot_points(data_rank1)
plt.title("Rank 1")
plt.show()
plot_points(data_rank2)
plt.title("Rank 2")
plt.show()
plot_points(data_rank3)
plt.title("Rank 3")
plt.show()
plot_points(data_rank4)
plt.title("Rank 4")
plt.show()
In [ ]:
# Make dummy variables for rank
one_hot_data = pd.concat([data, pd.get_dummies(data['rank'], prefix='rank')], axis=1)
# Drop the previous rank column
one_hot_data = one_hot_data.drop('rank', axis=1)
# Print the first 10 rows of our data
one_hot_data[:10]
admit | gre | gpa | rank_1 | rank_2 | rank_3 | rank_4 | |
---|---|---|---|---|---|---|---|
0 | 0 | 380 | 3.61 | 0 | 0 | 1 | 0 |
1 | 1 | 660 | 3.67 | 0 | 0 | 1 | 0 |
2 | 1 | 800 | 4.00 | 1 | 0 | 0 | 0 |
3 | 1 | 640 | 3.19 | 0 | 0 | 0 | 1 |
4 | 0 | 520 | 2.93 | 0 | 0 | 0 | 1 |
5 | 1 | 760 | 3.00 | 0 | 1 | 0 | 0 |
6 | 1 | 560 | 2.98 | 1 | 0 | 0 | 0 |
7 | 0 | 400 | 3.08 | 0 | 1 | 0 | 0 |
8 | 1 | 540 | 3.39 | 0 | 0 | 1 | 0 |
9 | 0 | 700 | 3.92 | 0 | 1 | 0 | 0 |
下一步是缩放数据。 我们注意到成绩 (grades) 的范围是 1.0-4.0,而测试分数 (test scores) 的范围大概是 200-800,这个范围要大得多。 这意味着我们的数据存在偏差,使得神经网络很难处理。 让我们将两个特征放在 0-1 的范围内,将分数除以 4.0,将测试分数除以 800。
In [ ]:
# Copying our data
processed_data = one_hot_data[:]
# Scaling the columns
processed_data['gre'] = processed_data['gre']/800
processed_data['gpa'] = processed_data['gpa']/4.0
processed_data[:10]
admit | gre | gpa | rank_1 | rank_2 | rank_3 | rank_4 | |
---|---|---|---|---|---|---|---|
0 | 0 | 0.475 | 0.9025 | 0 | 0 | 1 | 0 |
1 | 1 | 0.825 | 0.9175 | 0 | 0 | 1 | 0 |
2 | 1 | 1.000 | 1.0000 | 1 | 0 | 0 | 0 |
3 | 1 | 0.800 | 0.7975 | 0 | 0 | 0 | 1 |
4 | 0 | 0.650 | 0.7325 | 0 | 0 | 0 | 1 |
5 | 1 | 0.950 | 0.7500 | 0 | 1 | 0 | 0 |
6 | 1 | 0.700 | 0.7450 | 1 | 0 | 0 | 0 |
7 | 0 | 0.500 | 0.7700 | 0 | 1 | 0 | 0 |
8 | 1 | 0.675 | 0.8475 | 0 | 0 | 1 | 0 |
9 | 0 | 0.875 | 0.9800 | 0 | 1 | 0 | 0 |
为了测试我们的算法,我们将数据分为训练集和测试集。 测试集的大小将占总数据的 10%。
In [ ]:
sample = np.random.choice(processed_data.index, size=int(len(processed_data)*0.9), replace=False)
train_data, test_data = processed_data.iloc[sample], processed_data.drop(sample)
print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))
print(train_data[:10])
print(test_data[:10])
In [ ]:
Number of training samples is 360
Number of testing samples is 40
admit gre gpa rank_1 rank_2 rank_3 rank_4
375 0 0.700 0.8725 0 0 0 1
270 1 0.800 0.9875 0 1 0 0
87 0 0.750 0.8700 0 1 0 0
296 0 0.700 0.7900 1 0 0 0
340 0 0.625 0.8075 0 0 0 1
35 0 0.500 0.7625 0 1 0 0
293 0 1.000 0.9925 1 0 0 0
372 1 0.850 0.6050 1 0 0 0
307 0 0.725 0.8775 0 1 0 0
114 0 0.900 0.9600 0 0 1 0
admit gre gpa rank_1 rank_2 rank_3 rank_4
0 0 0.475 0.9025 0 0 1 0
6 1 0.700 0.7450 1 0 0 0
13 0 0.875 0.7700 0 1 0 0
16 0 0.975 0.9675 0 0 0 1
17 0 0.450 0.6400 0 0 1 0
22 0 0.750 0.7050 0 0 0 1
24 1 0.950 0.8375 0 1 0 0
27 1 0.650 0.9350 0 0 0 1
33 1 1.000 1.0000 0 0 1 0
39 1 0.650 0.6700 0 0 1 0
In [ ]:
import keras
# Separate data and one-hot encode the output
# Note: We're also turning the data into numpy arrays, in order to train the model in Keras
features = np.array(train_data.drop('admit', axis=1))
targets = np.array(keras.utils.to_categorical(train_data['admit'], 2))
features_test = np.array(test_data.drop('admit', axis=1))
targets_test = np.array(keras.utils.to_categorical(test_data['admit'], 2))
print(features[:10])
print(targets[:10])
In [ ]:
[[ 0.7 0.8725 0. 0. 0. 1. ]
[ 0.8 0.9875 0. 1. 0. 0. ]
[ 0.75 0.87 0. 1. 0. 0. ]
[ 0.7 0.79 1. 0. 0. 0. ]
[ 0.625 0.8075 0. 0. 0. 1. ]
[ 0.5 0.7625 0. 1. 0. 0. ]
[ 1. 0.9925 1. 0. 0. 0. ]
[ 0.85 0.605 1. 0. 0. 0. ]
[ 0.725 0.8775 0. 1. 0. 0. ]
[ 0.9 0.96 0. 0. 1. 0. ]]
[[ 1. 0.]
[ 0. 1.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 0. 1.]
[ 1. 0.]
[ 1. 0.]]
In [ ]:
# Imports
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
# Building the model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(6,)))
model.add(Dropout(.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(.1))
model.add(Dense(2, activation='softmax'))
# Compiling the model
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
In [ ]:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_16 (Dense) (None, 128) 896
_________________________________________________________________
dropout_11 (Dropout) (None, 128) 0
_________________________________________________________________
dense_17 (Dense) (None, 64) 8256
_________________________________________________________________
dropout_12 (Dropout) (None, 64) 0
_________________________________________________________________
dense_18 (Dense) (None, 2) 130
=================================================================
Total params: 9,282
Trainable params: 9,282
Non-trainable params: 0
_________________________________________________________________
In [ ]:
# Training the model
model.fit(features, targets, epochs=200, batch_size=100, verbose=0)
In [ ]:
<keras.callbacks.History at 0x114a34eb8>
In [ ]:
# Evaluating the model on the training and testing set
score = model.evaluate(features, targets)
print("\n Training Accuracy:", score[1])
score = model.evaluate(features_test, targets_test)
print("\n Testing Accuracy:", score[1])
In [ ]:
32/360 [=>............................] - ETA: 0s
Training Accuracy: 0.730555555556
32/40 [=======================>......] - ETA: 0s
Testing Accuracy: 0.7