In this lab session we will learn how to pre-process feature vectors using numpy. For this purpose, lets create 10 feature vectors that have 5 features. We use numpy.random for generating these examples.



In [1]:

    
import numpy
X = numpy.random.randn(10, 5)

Lets print this matrix X where each row is a feature vector.



In [2]:

    
print X









    



[[ 0.69443373  0.96195222  1.34055028 -0.06060727 -0.71991928]
 [ 1.78509174  0.65199536  0.16140449  0.10310347  1.90992752]
 [ 1.37616541  1.80505576 -0.95833519 -1.02030559  0.04881924]
 [ 1.10715636 -0.39356113 -0.37077065 -0.39378517  1.23563678]
 [-1.94500005 -0.01895106  1.64168871 -0.50474564 -0.77917618]
 [-0.30773122 -0.41844751 -1.66776205  0.71960226  0.63807079]
 [ 1.05687671 -0.48325336  0.29916813 -1.52100051  0.32204151]
 [ 0.75886575  0.51190147  1.04201629  0.37280639 -2.07993217]
 [ 0.11850333 -0.34930383  1.39946381  0.89181387  0.82877795]
 [ 0.1124358  -0.87777225  0.2452083  -0.85698553  1.31558673]]

We can access the i-th row by X[i,:]. Likewise, the j-th column of X can be accessed by X[:,j]



In [3]:

    
print X[1,:]









    



[ 1.78509174  0.65199536  0.16140449  0.10310347  1.90992752]



In [4]:

    
print X[:,1]









    



[ 0.96195222  0.65199536  1.80505576 -0.39356113 -0.01895106 -0.41844751
 -0.48325336  0.51190147 -0.34930383 -0.87777225]

Next, lets $\ell_1$ normalize each feature vector. For this purpose we must compute the sum of the absolute values in each feature vector and divide each element in the vector by the norm. $\ell_1$ norm is defined as follows:

$\ell_1 (\mathbf{x}) = \sum_i |x_i|$

Let us compue the $\ell_1$ norm of each feature vector in X. We can use the abs function that gives the absolute value of a number. This function operates on each element of an array as well, which is very convenient. sum function gives the sum, obivously!



In [6]:

    
for i in range(0, 10):
    print i, numpy.sum(numpy.abs(X[i,:]))









    



0 3.77746277978
1 4.61152257455
2 5.20868119231
3 3.50091009033
4 4.88956162894
5 3.75161383363
6 3.68234022551
7 4.76552206717
8 3.58786278895
9 3.40798861386

Now lets compute $\ell_2$ norms instead. We need to compute the squares, add them and take the square root for this.



In [7]:

    
for i in range(0, 10):
    print i, numpy.sqrt(numpy.sum(X[i,:] * X[i,:]))









    



0 1.93044615389
1 2.7011396334
2 2.66718403482
3 1.78886037946
4 2.70931907007
5 1.99403832221
6 1.9639697193
7 2.52761166383
8 1.8912324085
9 1.8189133254

If you wanted to $\ell_2$ normalize X then this can be done as follows.



In [8]:

    
for i in range(0,10):
    norm = numpy.sqrt(numpy.sum(X[i,:] * X[i,:]))
    X[i,:] = X[i,:] / norm



In [9]:

    
print X









    



[[ 0.35972706  0.49830565  0.69442511 -0.03139547 -0.37292896]
 [ 0.66086615  0.24137788  0.05975422  0.03817036  0.70708211]
 [ 0.51596193  0.67676461 -0.35930599 -0.38254038  0.01830367]
 [ 0.61891715 -0.22000662 -0.2072664  -0.22013186  0.69073964]
 [-0.71789258 -0.00699477  0.60594144 -0.18629981 -0.28759115]
 [-0.15432563 -0.20984928 -0.83637412  0.36087685  0.31998923]
 [ 0.53813289 -0.24605948  0.15232828 -0.77445212  0.16397479]
 [ 0.30023036  0.20252378  0.41225331  0.14749354 -0.82288439]
 [ 0.06265932 -0.18469641  0.73997453  0.47155171  0.4382211 ]
 [ 0.06181482 -0.48258058  0.13481033 -0.47115248  0.72328171]]

Just to make sure that X is indeed $\ell_2$ normalized lets print the norms again.



In [10]:

    
for i in range(0,10):
    print i, numpy.sqrt(numpy.sum(X[i,:] * X[i,:]))

OK! That looks fine. Now try to $\ell_1$ normalize X as well by yourself.

Let us assume that we further wish to scale each feature (dimension) to [0,1] range using (x - min) / (max - min) method (see the lecture notes for details). We need to find the min and max for each feature across all feature vectors. This turns out to be computing the min and max for each column in X. Guess what, numpy has min and max functions that return the min and max values of an array. How convenient...



In [11]:

    
print X[:,0]
print numpy.min(X[:,0])
print numpy.max(X[:,0])









    



[ 0.35972706  0.66086615  0.51596193  0.61891715 -0.71789258 -0.15432563
  0.53813289  0.30023036  0.06265932  0.06181482]
-0.717892575778
0.660866145131

Lets use these functions to perform the [0,1] scaling on X.



In [12]:

    
for j in range(0, 5):
    minVal = numpy.min(X[:,j])
    maxVal = numpy.max(X[:,j])
    for i in range(0, 10):
        X[i,j] = (X[i,j] - minVal) / (maxVal - minVal)



In [13]:

    
print X









    



[[ 0.78158682  0.84606918  0.97110448  0.59635182  0.29101364]
 [ 1.          0.62445462  0.56848359  0.65218297  0.98952273]
 [ 0.89490241  1.          0.30264126  0.31453494  0.54404766]
 [ 0.96957481  0.22648471  0.39909174  0.44487845  0.97895306]
 [ 0.          0.41021934  0.91497244  0.4720309   0.34620682]
 [ 0.40874951  0.23524598  0.          0.91117615  0.73916614]
 [ 0.91098279  0.20401267  0.62721049  0.          0.63826207]
 [ 0.73843445  0.59094079  0.79210106  0.73992201  0.        ]
 [ 0.56612653  0.25694174  1.          1.          0.8156339 ]
 [ 0.56551403  0.          0.61609749  0.2434179   1.        ]]

OK! Everything is in [0,1] now. One thing to remember is that if min and max are the same then the division during the scaling will be illegal. If this is the case then it means all values of that feature are the same. So you can either set it to 0 or 1, as you wish as long as it is consistent. Of course, if a feature has the same value across all train instances then it is not a useful feature because it does not discriminate the different classes. So you can even remove that feature from your train data and be happy about it (one less feature to worry about).

Let us assume that we wanted to do Gaussain scaling (see lecture notes) on this X. Here, we would use (x - mean) / sd, where sd is the standard deviation of the feature values. Not very surprisingly numpy has numpy.mean and numpy.std functions that do exactly this. I guess at this point I can convince you why you should use python+numpy for data mining and machine learning.



In [14]:

    
for j in range(0, 5):
    mean = numpy.mean(X[:,j])
    sd = numpy.std(X[:,j])
    for i in range(0, 10):
        X[i,j] = (X[i,j] - mean) / sd



In [15]:

    
print X









    



[[ 0.33389546  1.35057571  1.15970477  0.20214035 -1.06754809]
 [ 1.07805267  0.61451206 -0.16702429  0.39374116  1.10479045]
 [ 0.71997388  1.86183672 -1.04303635 -0.76499606 -0.28062124]
 [ 0.97439077 -0.70729275 -0.72520968 -0.31768446  1.07191914]
 [-2.32905491 -0.09704231  0.9747367  -0.22450297 -0.89589928]
 [-0.93640137 -0.67819333 -2.04030937  1.28255142  0.32619029]
 [ 0.77476147 -0.78193069  0.02649442 -1.84441401  0.01238222]
 [ 0.18687069  0.50320027  0.56984705  0.69484314 -1.97259021]
 [-0.40020091 -0.60613371  1.2549222   1.58737625  0.56400235]
 [-0.40228777 -1.45953197 -0.01012546 -1.00905482  1.13737437]]



In [ ]: