In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Use pandas to read in the csv file called anonymized_data.csv . It contains 500 rows and 30 columns of anonymized data along with 1 last column with a classification label, where the columns have been renamed to 4 letter codes.
In [2]:
data = pd.read_csv('./data/anonymized_data.csv')
In [3]:
data.head()
Out[3]:
In [4]:
data.info()
In [5]:
data.describe()
Out[5]:
In [6]:
from sklearn.preprocessing import MinMaxScaler
In [7]:
scaler = MinMaxScaler()
In [8]:
X_data = scaler.fit_transform(data.drop('Label', axis = 1))
In [9]:
pd.DataFrame(X_data, columns = data.columns[:-1]).describe()
Out[9]:
Import tensorflow and import fully_connected layers from tensorflow.contrib.layers.
In [10]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
Fill out the number of inputs to fit the dimensions of the data set and set the hidden number of units to be 2. Also set the number of outputs to match the number of inputs. Also choose a learning_rate value.
In [11]:
num_inputs = 30 # FILL ME IN
num_hidden = 2 # FILL ME IN
num_outputs = num_inputs # Must be true for an autoencoder!
learning_rate = 0.01 #FILL ME IN
In [12]:
X = tf.placeholder(tf.float32, shape = [None, num_inputs])
In [13]:
hidden_layer = fully_connected(inputs = X,
num_outputs = num_hidden,
activation_fn = None)
outputs = fully_connected(inputs = hidden_layer,
num_outputs = num_outputs,
activation_fn = None)
In [14]:
loss = tf.reduce_mean(tf.square(outputs - X))
Create an AdamOptimizer designed to minimize the previous loss function.
In [15]:
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(loss)
In [16]:
init = tf.global_variables_initializer()
In [17]:
num_steps = 1000
with tf.Session() as sess:
sess.run(init)
for iteration in range(num_steps):
sess.run(train,
feed_dict = {X: X_data})
# Now ask for the hidden layer output (the 2 dimensional output)
output_2d = hidden_layer.eval(feed_dict = {X: X_data})
Confirm that your output is now 2 dimensional along the previous axis of 30 features.
In [18]:
output_2d.shape
Out[18]:
Now plot out the reduced dimensional representation of the data. Do you still have clear separation of classes even with the reduction in dimensions? Hint: You definitely should, the classes should still be clearly seperable, even when reduced to 2 dimensions.
In [19]:
plt.scatter(output_2d[:, 0],
output_2d[:, 1],
c = data['Label'])
Out[19]: