Linear Autoencoder for PCA - EXERCISE

Follow the bold instructions below to reduce a 30 dimensional data set for classification into a 2-dimensional dataset! Then use the color classes to see if you still kept the same level of class separation in the dimensionality reduction

The Data

Import numpy, matplotlib, and pandas



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Use pandas to read in the csv file called anonymized_data.csv . It contains 500 rows and 30 columns of anonymized data along with 1 last column with a classification label, where the columns have been renamed to 4 letter codes.



In [2]:

    
data = pd.read_csv('./data/anonymized_data.csv')



In [3]:

    
data.head()









    Out[3]:







  
    
      
      EJWY
      VALM
      EGXO
      HTGR
      SKRF
      NNSZ
      NYLC
      GWID
      TVUT
      CJHI
      ...
      LKKS
      UOBF
      VBHE
      FRWU
      NDYZ
      QSBO
      JDUB
      TEVK
      EZTM
      Label
    
  
  
    
      0
      -2.032145
      1.019576
      -9.658715
      -6.210495
      3.156823
      7.457850
      -5.313357
      8.508296
      3.959194
      -5.246654
      ...
      -2.209663
      -10.340123
      -7.697555
      -5.932752
      10.872688
      0.081321
      1.276316
      5.281225
      -0.516447
      0.0
    
    
      1
      8.306217
      6.649376
      -0.960333
      -4.094799
      8.738965
      -3.458797
      7.016800
      6.692765
      0.898264
      9.337643
      ...
      0.851793
      -9.678324
      -6.071795
      1.428194
      -8.082792
      -0.557089
      -7.817282
      -8.686722
      -6.953100
      1.0
    
    
      2
      6.570842
      6.985462
      -1.842621
      -1.569599
      10.039339
      -3.623026
      8.957619
      7.577283
      1.541255
      7.161509
      ...
      1.376085
      -8.971164
      -5.302191
      2.898965
      -8.746597
      -0.520888
      -7.350999
      -8.925501
      -7.051179
      1.0
    
    
      3
      -1.139972
      0.579422
      -9.526530
      -5.744928
      4.834355
      5.907235
      -4.804137
      6.798810
      5.403670
      -7.642857
      ...
      0.270571
      -8.640988
      -8.105419
      -5.079015
      9.351282
      0.641759
      1.898083
      3.904671
      1.453499
      0.0
    
    
      4
      -1.738104
      0.234729
      -11.558768
      -7.181332
      4.189626
      7.765274
      -2.189083
      7.239925
      3.135602
      -6.211390
      ...
      -0.013973
      -9.437110
      -6.475267
      -5.708377
      9.623080
      1.802899
      1.903705
      4.188442
      1.522362
      0.0
    
  

5 rows × 31 columns



In [4]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 31 columns):
EJWY     500 non-null float64
VALM     500 non-null float64
EGXO     500 non-null float64
HTGR     500 non-null float64
SKRF     500 non-null float64
NNSZ     500 non-null float64
NYLC     500 non-null float64
GWID     500 non-null float64
TVUT     500 non-null float64
CJHI     500 non-null float64
NVFW     500 non-null float64
VLBG     500 non-null float64
IDIX     500 non-null float64
UVHN     500 non-null float64
IWOT     500 non-null float64
LEMB     500 non-null float64
QMYY     500 non-null float64
XDGR     500 non-null float64
ODZS     500 non-null float64
LNJS     500 non-null float64
WDRT     500 non-null float64
LKKS     500 non-null float64
UOBF     500 non-null float64
VBHE     500 non-null float64
FRWU     500 non-null float64
NDYZ     500 non-null float64
QSBO     500 non-null float64
JDUB     500 non-null float64
TEVK     500 non-null float64
EZTM     500 non-null float64
Label    500 non-null float64
dtypes: float64(31)
memory usage: 121.2 KB



In [5]:

    
data.describe()









    Out[5]:







  
    
      
      EJWY
      VALM
      EGXO
      HTGR
      SKRF
      NNSZ
      NYLC
      GWID
      TVUT
      CJHI
      ...
      LKKS
      UOBF
      VBHE
      FRWU
      NDYZ
      QSBO
      JDUB
      TEVK
      EZTM
      Label
    
  
  
    
      count
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      ...
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
    
    
      mean
      4.237752
      3.755108
      -5.614445
      -4.747200
      6.447995
      1.776850
      1.718450
      7.208016
      2.556548
      1.222064
      ...
      0.295252
      -9.053808
      -6.291877
      -2.345864
      1.125596
      0.284048
      -2.817147
      -2.192278
      -2.816977
      0.500000
    
    
      std
      4.121210
      2.540833
      3.853295
      2.164355
      2.796104
      5.030617
      5.771508
      1.167246
      2.146874
      7.410762
      ...
      1.017020
      1.008391
      1.305176
      3.973564
      8.839871
      1.045746
      4.548817
      6.960762
      3.758615
      0.500501
    
    
      min
      -2.032145
      -1.677119
      -12.167510
      -9.507402
      1.220239
      -5.435379
      -6.699806
      4.074939
      -2.830792
      -8.851496
      ...
      -3.046497
      -12.128499
      -9.582822
      -9.367262
      -10.986387
      -2.595682
      -9.710075
      -11.325978
      -9.363069
      0.000000
    
    
      25%
      0.287295
      1.450981
      -9.258086
      -6.608699
      3.816363
      -3.246286
      -3.921556
      6.457160
      0.742799
      -5.980770
      ...
      -0.346735
      -9.698782
      -7.330375
      -6.232200
      -7.569584
      -0.466278
      -7.291228
      -9.077094
      -6.421727
      0.000000
    
    
      50%
      4.212893
      4.122470
      -4.681202
      -4.521427
      6.009192
      1.465326
      2.119661
      7.148805
      2.399665
      1.082333
      ...
      0.258733
      -9.066828
      -6.262909
      -2.188896
      1.200635
      0.229365
      -2.450744
      -1.828291
      -2.160272
      0.500000
    
    
      75%
      8.238277
      6.066863
      -1.901586
      -2.879066
      9.145269
      6.819129
      7.323175
      7.974873
      4.526339
      8.480955
      ...
      1.028362
      -8.344404
      -5.314031
      1.427888
      9.875877
      0.983905
      1.569697
      4.648586
      0.744805
      1.000000
    
    
      max
      11.221614
      8.464551
      0.806140
      -0.109049
      12.327433
      9.730383
      9.918112
      10.449979
      7.032117
      11.569669
      ...
      3.600537
      -4.976943
      -2.583479
      4.686482
      12.750833
      3.770563
      4.717894
      7.294646
      3.375074
      1.000000
    
  

8 rows × 31 columns

Scale the Data

Use scikit learn to scale the data with a MinMaxScaler. Remember not to scale the Label column, just the data. Save this scaled data as a new variable called scaled_data.



In [6]:

    
from sklearn.preprocessing import MinMaxScaler



In [7]:

    
scaler = MinMaxScaler()



In [8]:

    
X_data = scaler.fit_transform(data.drop('Label', axis = 1))



In [9]:

    
pd.DataFrame(X_data, columns = data.columns[:-1]).describe()









    Out[9]:







  
    
      
      EJWY
      VALM
      EGXO
      HTGR
      SKRF
      NNSZ
      NYLC
      GWID
      TVUT
      CJHI
      ...
      WDRT
      LKKS
      UOBF
      VBHE
      FRWU
      NDYZ
      QSBO
      JDUB
      TEVK
      EZTM
    
  
  
    
      count
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      ...
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
      500.000000
    
    
      mean
      0.473066
      0.535634
      0.505106
      0.506493
      0.470664
      0.475560
      0.506577
      0.491460
      0.546222
      0.493290
      ...
      0.566938
      0.502743
      0.429933
      0.470179
      0.499611
      0.510253
      0.452344
      0.477748
      0.490515
      0.513897
    
    
      std
      0.310946
      0.250534
      0.297009
      0.230291
      0.251738
      0.331709
      0.347306
      0.183096
      0.217671
      0.362896
      ...
      0.154181
      0.153004
      0.141003
      0.186471
      0.282741
      0.372406
      0.164264
      0.315278
      0.373820
      0.295068
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      0.175002
      0.308440
      0.224256
      0.308427
      0.233734
      0.144344
      0.167184
      0.373679
      0.362326
      0.140576
      ...
      0.471542
      0.406161
      0.339747
      0.321808
      0.223077
      0.143943
      0.334484
      0.167650
      0.120774
      0.230908
    
    
      50%
      0.471190
      0.571857
      0.577039
      0.530516
      0.431158
      0.455019
      0.530720
      0.482172
      0.530316
      0.486448
      ...
      0.562805
      0.497249
      0.428113
      0.474318
      0.510780
      0.513414
      0.443754
      0.503143
      0.510063
      0.565451
    
    
      75%
      0.774906
      0.763581
      0.791290
      0.705266
      0.713504
      0.808038
      0.843847
      0.611751
      0.745939
      0.848749
      ...
      0.657490
      0.613034
      0.529129
      0.609885
      0.768133
      0.878884
      0.562276
      0.781799
      0.857896
      0.793512
    
    
      max
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      ...
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
  

8 rows × 30 columns

The Linear Autoencoder

Import tensorflow and import fully_connected layers from tensorflow.contrib.layers.



In [10]:

    
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected









    



WARNING:tensorflow:From c:\programdata\anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

Fill out the number of inputs to fit the dimensions of the data set and set the hidden number of units to be 2. Also set the number of outputs to match the number of inputs. Also choose a learning_rate value.



In [11]:

    
num_inputs = 30 # FILL ME IN
num_hidden = 2 # FILL ME IN 
num_outputs = num_inputs # Must be true for an autoencoder!

learning_rate = 0.01 #FILL ME IN

Placeholder

Create a placeholder fot the data called X.



In [12]:

    
X = tf.placeholder(tf.float32, shape = [None, num_inputs])

Layers

Create the hidden layer and the output layers using the fully_connected function. Remember that to perform PCA there is no activation function.



In [13]:

    
hidden_layer = fully_connected(inputs = X, 
                               num_outputs = num_hidden, 
                               activation_fn = None)
outputs = fully_connected(inputs = hidden_layer, 
                         num_outputs = num_outputs, 
                         activation_fn = None)

Loss Function

Create a Mean Squared Error loss function.



In [14]:

    
loss = tf.reduce_mean(tf.square(outputs - X))

Optimizer

Create an AdamOptimizer designed to minimize the previous loss function.



In [15]:

    
optimizer = tf.train.AdamOptimizer(learning_rate)
train  = optimizer.minimize(loss)

Init

Create an instance of a global variable intializer.



In [16]:

    
init = tf.global_variables_initializer()

Running the Session

Now create a Tensorflow session that runs the optimizer for at least 1000 steps. (You can also use epochs if you prefer, where 1 epoch is defined by one single run through the entire dataset.



In [17]:

    
num_steps = 1000

with tf.Session() as sess:
    sess.run(init)
    for iteration in range(num_steps):
        sess.run(train,
                 feed_dict = {X: X_data})

    # Now ask for the hidden layer output (the 2 dimensional output)
    output_2d = hidden_layer.eval(feed_dict = {X: X_data})

Confirm that your output is now 2 dimensional along the previous axis of 30 features.



In [18]:

    
output_2d.shape









    Out[18]:





(500, 2)

Now plot out the reduced dimensional representation of the data. Do you still have clear separation of classes even with the reduction in dimensions? Hint: You definitely should, the classes should still be clearly seperable, even when reduced to 2 dimensions.



In [19]:

    
plt.scatter(output_2d[:, 0],
            output_2d[:, 1],
            c = data['Label'])









    Out[19]:





<matplotlib.collections.PathCollection at 0x21972b5def0>

	EJWY	VALM	EGXO	HTGR	SKRF	NNSZ	NYLC	GWID	TVUT	CJHI	...	LKKS	UOBF	VBHE	FRWU	NDYZ	QSBO	JDUB	TEVK	EZTM	Label
0	-2.032145	1.019576	-9.658715	-6.210495	3.156823	7.457850	-5.313357	8.508296	3.959194	-5.246654	...	-2.209663	-10.340123	-7.697555	-5.932752	10.872688	0.081321	1.276316	5.281225	-0.516447	0.0
1	8.306217	6.649376	-0.960333	-4.094799	8.738965	-3.458797	7.016800	6.692765	0.898264	9.337643	...	0.851793	-9.678324	-6.071795	1.428194	-8.082792	-0.557089	-7.817282	-8.686722	-6.953100	1.0
2	6.570842	6.985462	-1.842621	-1.569599	10.039339	-3.623026	8.957619	7.577283	1.541255	7.161509	...	1.376085	-8.971164	-5.302191	2.898965	-8.746597	-0.520888	-7.350999	-8.925501	-7.051179	1.0
3	-1.139972	0.579422	-9.526530	-5.744928	4.834355	5.907235	-4.804137	6.798810	5.403670	-7.642857	...	0.270571	-8.640988	-8.105419	-5.079015	9.351282	0.641759	1.898083	3.904671	1.453499	0.0
4	-1.738104	0.234729	-11.558768	-7.181332	4.189626	7.765274	-2.189083	7.239925	3.135602	-6.211390	...	-0.013973	-9.437110	-6.475267	-5.708377	9.623080	1.802899	1.903705	4.188442	1.522362	0.0

	EJWY	VALM	EGXO	HTGR	SKRF	NNSZ	NYLC	GWID	TVUT	CJHI	...	LKKS	UOBF	VBHE	FRWU	NDYZ	QSBO	JDUB	TEVK	EZTM	Label
count	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	...	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000	500.000000
mean	4.237752	3.755108	-5.614445	-4.747200	6.447995	1.776850	1.718450	7.208016	2.556548	1.222064	...	0.295252	-9.053808	-6.291877	-2.345864	1.125596	0.284048	-2.817147	-2.192278	-2.816977	0.500000
std	4.121210	2.540833	3.853295	2.164355	2.796104	5.030617	5.771508	1.167246	2.146874	7.410762	...	1.017020	1.008391	1.305176	3.973564	8.839871	1.045746	4.548817	6.960762	3.758615	0.500501
min	-2.032145	-1.677119	-12.167510	-9.507402	1.220239	-5.435379	-6.699806	4.074939	-2.830792	-8.851496	...	-3.046497	-12.128499	-9.582822	-9.367262	-10.986387	-2.595682	-9.710075	-11.325978	-9.363069	0.000000
25%	0.287295	1.450981	-9.258086	-6.608699	3.816363	-3.246286	-3.921556	6.457160	0.742799	-5.980770	...	-0.346735	-9.698782	-7.330375	-6.232200	-7.569584	-0.466278	-7.291228	-9.077094	-6.421727	0.000000
50%	4.212893	4.122470	-4.681202	-4.521427	6.009192	1.465326	2.119661	7.148805	2.399665	1.082333	...	0.258733	-9.066828	-6.262909	-2.188896	1.200635	0.229365	-2.450744	-1.828291	-2.160272	0.500000
75%	8.238277	6.066863	-1.901586	-2.879066	9.145269	6.819129	7.323175	7.974873	4.526339	8.480955	...	1.028362	-8.344404	-5.314031	1.427888	9.875877	0.983905	1.569697	4.648586	0.744805	1.000000
max	11.221614	8.464551	0.806140	-0.109049	12.327433	9.730383	9.918112	10.449979	7.032117	11.569669	...	3.600537	-4.976943	-2.583479	4.686482	12.750833	3.770563	4.717894	7.294646	3.375074	1.000000