Linear Autoencoder for PCA - EXERCISE

Follow the bold instructions below to reduce a 30 dimensional data set for classification into a 2-dimensional dataset! Then use the color classes to see if you still kept the same level of class separation in the dimensionality reduction

The Data

Import numpy, matplotlib, and pandas


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Use pandas to read in the csv file called anonymized_data.csv . It contains 500 rows and 30 columns of anonymized data along with 1 last column with a classification label, where the columns have been renamed to 4 letter codes.


In [2]:
data = pd.read_csv('./data/anonymized_data.csv')

In [3]:
data.head()


Out[3]:
EJWY VALM EGXO HTGR SKRF NNSZ NYLC GWID TVUT CJHI ... LKKS UOBF VBHE FRWU NDYZ QSBO JDUB TEVK EZTM Label
0 -2.032145 1.019576 -9.658715 -6.210495 3.156823 7.457850 -5.313357 8.508296 3.959194 -5.246654 ... -2.209663 -10.340123 -7.697555 -5.932752 10.872688 0.081321 1.276316 5.281225 -0.516447 0.0
1 8.306217 6.649376 -0.960333 -4.094799 8.738965 -3.458797 7.016800 6.692765 0.898264 9.337643 ... 0.851793 -9.678324 -6.071795 1.428194 -8.082792 -0.557089 -7.817282 -8.686722 -6.953100 1.0
2 6.570842 6.985462 -1.842621 -1.569599 10.039339 -3.623026 8.957619 7.577283 1.541255 7.161509 ... 1.376085 -8.971164 -5.302191 2.898965 -8.746597 -0.520888 -7.350999 -8.925501 -7.051179 1.0
3 -1.139972 0.579422 -9.526530 -5.744928 4.834355 5.907235 -4.804137 6.798810 5.403670 -7.642857 ... 0.270571 -8.640988 -8.105419 -5.079015 9.351282 0.641759 1.898083 3.904671 1.453499 0.0
4 -1.738104 0.234729 -11.558768 -7.181332 4.189626 7.765274 -2.189083 7.239925 3.135602 -6.211390 ... -0.013973 -9.437110 -6.475267 -5.708377 9.623080 1.802899 1.903705 4.188442 1.522362 0.0

5 rows × 31 columns


In [4]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 31 columns):
EJWY     500 non-null float64
VALM     500 non-null float64
EGXO     500 non-null float64
HTGR     500 non-null float64
SKRF     500 non-null float64
NNSZ     500 non-null float64
NYLC     500 non-null float64
GWID     500 non-null float64
TVUT     500 non-null float64
CJHI     500 non-null float64
NVFW     500 non-null float64
VLBG     500 non-null float64
IDIX     500 non-null float64
UVHN     500 non-null float64
IWOT     500 non-null float64
LEMB     500 non-null float64
QMYY     500 non-null float64
XDGR     500 non-null float64
ODZS     500 non-null float64
LNJS     500 non-null float64
WDRT     500 non-null float64
LKKS     500 non-null float64
UOBF     500 non-null float64
VBHE     500 non-null float64
FRWU     500 non-null float64
NDYZ     500 non-null float64
QSBO     500 non-null float64
JDUB     500 non-null float64
TEVK     500 non-null float64
EZTM     500 non-null float64
Label    500 non-null float64
dtypes: float64(31)
memory usage: 121.2 KB

In [5]:
data.describe()


Out[5]:
EJWY VALM EGXO HTGR SKRF NNSZ NYLC GWID TVUT CJHI ... LKKS UOBF VBHE FRWU NDYZ QSBO JDUB TEVK EZTM Label
count 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 ... 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000
mean 4.237752 3.755108 -5.614445 -4.747200 6.447995 1.776850 1.718450 7.208016 2.556548 1.222064 ... 0.295252 -9.053808 -6.291877 -2.345864 1.125596 0.284048 -2.817147 -2.192278 -2.816977 0.500000
std 4.121210 2.540833 3.853295 2.164355 2.796104 5.030617 5.771508 1.167246 2.146874 7.410762 ... 1.017020 1.008391 1.305176 3.973564 8.839871 1.045746 4.548817 6.960762 3.758615 0.500501
min -2.032145 -1.677119 -12.167510 -9.507402 1.220239 -5.435379 -6.699806 4.074939 -2.830792 -8.851496 ... -3.046497 -12.128499 -9.582822 -9.367262 -10.986387 -2.595682 -9.710075 -11.325978 -9.363069 0.000000
25% 0.287295 1.450981 -9.258086 -6.608699 3.816363 -3.246286 -3.921556 6.457160 0.742799 -5.980770 ... -0.346735 -9.698782 -7.330375 -6.232200 -7.569584 -0.466278 -7.291228 -9.077094 -6.421727 0.000000
50% 4.212893 4.122470 -4.681202 -4.521427 6.009192 1.465326 2.119661 7.148805 2.399665 1.082333 ... 0.258733 -9.066828 -6.262909 -2.188896 1.200635 0.229365 -2.450744 -1.828291 -2.160272 0.500000
75% 8.238277 6.066863 -1.901586 -2.879066 9.145269 6.819129 7.323175 7.974873 4.526339 8.480955 ... 1.028362 -8.344404 -5.314031 1.427888 9.875877 0.983905 1.569697 4.648586 0.744805 1.000000
max 11.221614 8.464551 0.806140 -0.109049 12.327433 9.730383 9.918112 10.449979 7.032117 11.569669 ... 3.600537 -4.976943 -2.583479 4.686482 12.750833 3.770563 4.717894 7.294646 3.375074 1.000000

8 rows × 31 columns

Scale the Data

Use scikit learn to scale the data with a MinMaxScaler. Remember not to scale the Label column, just the data. Save this scaled data as a new variable called scaled_data.


In [6]:
from sklearn.preprocessing import MinMaxScaler

In [7]:
scaler = MinMaxScaler()

In [8]:
X_data = scaler.fit_transform(data.drop('Label', axis = 1))

In [9]:
pd.DataFrame(X_data, columns = data.columns[:-1]).describe()


Out[9]:
EJWY VALM EGXO HTGR SKRF NNSZ NYLC GWID TVUT CJHI ... WDRT LKKS UOBF VBHE FRWU NDYZ QSBO JDUB TEVK EZTM
count 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 ... 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000 500.000000
mean 0.473066 0.535634 0.505106 0.506493 0.470664 0.475560 0.506577 0.491460 0.546222 0.493290 ... 0.566938 0.502743 0.429933 0.470179 0.499611 0.510253 0.452344 0.477748 0.490515 0.513897
std 0.310946 0.250534 0.297009 0.230291 0.251738 0.331709 0.347306 0.183096 0.217671 0.362896 ... 0.154181 0.153004 0.141003 0.186471 0.282741 0.372406 0.164264 0.315278 0.373820 0.295068
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.175002 0.308440 0.224256 0.308427 0.233734 0.144344 0.167184 0.373679 0.362326 0.140576 ... 0.471542 0.406161 0.339747 0.321808 0.223077 0.143943 0.334484 0.167650 0.120774 0.230908
50% 0.471190 0.571857 0.577039 0.530516 0.431158 0.455019 0.530720 0.482172 0.530316 0.486448 ... 0.562805 0.497249 0.428113 0.474318 0.510780 0.513414 0.443754 0.503143 0.510063 0.565451
75% 0.774906 0.763581 0.791290 0.705266 0.713504 0.808038 0.843847 0.611751 0.745939 0.848749 ... 0.657490 0.613034 0.529129 0.609885 0.768133 0.878884 0.562276 0.781799 0.857896 0.793512
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 30 columns

The Linear Autoencoder

Import tensorflow and import fully_connected layers from tensorflow.contrib.layers.


In [10]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected


WARNING:tensorflow:From c:\programdata\anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

Fill out the number of inputs to fit the dimensions of the data set and set the hidden number of units to be 2. Also set the number of outputs to match the number of inputs. Also choose a learning_rate value.


In [11]:
num_inputs = 30 # FILL ME IN
num_hidden = 2 # FILL ME IN 
num_outputs = num_inputs # Must be true for an autoencoder!

learning_rate = 0.01 #FILL ME IN

Placeholder

Create a placeholder fot the data called X.


In [12]:
X = tf.placeholder(tf.float32, shape = [None, num_inputs])

Layers

Create the hidden layer and the output layers using the fully_connected function. Remember that to perform PCA there is no activation function.


In [13]:
hidden_layer = fully_connected(inputs = X, 
                               num_outputs = num_hidden, 
                               activation_fn = None)
outputs = fully_connected(inputs = hidden_layer, 
                         num_outputs = num_outputs, 
                         activation_fn = None)

Loss Function

Create a Mean Squared Error loss function.


In [14]:
loss = tf.reduce_mean(tf.square(outputs - X))

Optimizer

Create an AdamOptimizer designed to minimize the previous loss function.


In [15]:
optimizer = tf.train.AdamOptimizer(learning_rate)
train  = optimizer.minimize(loss)

Init

Create an instance of a global variable intializer.


In [16]:
init = tf.global_variables_initializer()

Running the Session

Now create a Tensorflow session that runs the optimizer for at least 1000 steps. (You can also use epochs if you prefer, where 1 epoch is defined by one single run through the entire dataset.


In [17]:
num_steps = 1000

with tf.Session() as sess:
    sess.run(init)
    for iteration in range(num_steps):
        sess.run(train,
                 feed_dict = {X: X_data})

    # Now ask for the hidden layer output (the 2 dimensional output)
    output_2d = hidden_layer.eval(feed_dict = {X: X_data})

Confirm that your output is now 2 dimensional along the previous axis of 30 features.


In [18]:
output_2d.shape


Out[18]:
(500, 2)

Now plot out the reduced dimensional representation of the data. Do you still have clear separation of classes even with the reduction in dimensions? Hint: You definitely should, the classes should still be clearly seperable, even when reduced to 2 dimensions.


In [19]:
plt.scatter(output_2d[:, 0],
            output_2d[:, 1],
            c = data['Label'])


Out[19]:
<matplotlib.collections.PathCollection at 0x21972b5def0>

Great Job!