Practice the preprocessing from scikit-learn official doc [http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing]

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.



In [1]:

    
from sklearn import preprocessing
import numpy as np
X = np.array([[1., -1, 2.],
              [2., 0., 0.],
              [0., 1., -1.]])
X_scaled = preprocessing.scale(X)



In [2]:

    
X









    Out[2]:





array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])



In [3]:

    
X_scaled









    Out[3]:





array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])



In [4]:

    
X.mean(axis=0)









    Out[4]:





array([ 1.        ,  0.        ,  0.33333333])



In [9]:

    
X_scaled.mean(axis=0)









    Out[9]:





array([ 0.,  0.,  0.])



In [13]:

    
scaler = preprocessing.StandardScaler().fit(X)



In [15]:

    
print scaler.mean_
print scaler.scale_









    



[ 1.          0.          0.33333333]
[ 0.81649658  0.81649658  1.24721913]



In [25]:

    
# try use robust_scale or RobustScaler
X = np.array([[5., -1, 2.],
              [3., 0., 0.],
              [5., 1., -1.],
             [0.01, 1., -1.],
             [3., 1., -1.],
             [4., 1., -1.],
             [7., 1., -1.]])
X_1 = preprocessing.robust_scale(X)
X_2 = preprocessing.scale(X)



In [23]:

    
X_1









    Out[23]:





array([[ 0.5  , -4.   ,  6.   ],
       [-0.5  , -2.   ,  2.   ],
       [ 0.5  ,  0.   ,  0.   ],
       [-1.995,  0.   ,  0.   ],
       [-0.5  ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ],
       [ 1.5  ,  0.   ,  0.   ]])



In [21]:

    
X_1.mean(axis=0)









    Out[21]:





array([-0.07071429, -0.85714286,  1.14285714])



In [24]:

    
X_2.mean(axis=0)









    Out[24]:





array([  2.22044605e-16,   1.26882631e-16,  -3.17206578e-17])

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.



In [28]:

    
X_norm = preprocessing.normalize(X, norm='l2')



In [29]:

    
X_norm









    Out[29]:





array([[ 0.91287093, -0.18257419,  0.36514837],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.96225045,  0.19245009, -0.19245009],
       [ 0.00707089,  0.7070891 , -0.7070891 ],
       [ 0.90453403,  0.30151134, -0.30151134],
       [ 0.94280904,  0.23570226, -0.23570226],
       [ 0.98019606,  0.14002801, -0.14002801]])

Encoding categorical features



In [ ]:

    
enc =