Practice the preprocessing from scikit-learn official doc [http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing]

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.


In [1]:
from sklearn import preprocessing
import numpy as np
X = np.array([[1., -1, 2.],
              [2., 0., 0.],
              [0., 1., -1.]])
X_scaled = preprocessing.scale(X)

In [2]:
X


Out[2]:
array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])

In [3]:
X_scaled


Out[3]:
array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [4]:
X.mean(axis=0)


Out[4]:
array([ 1.        ,  0.        ,  0.33333333])

In [9]:
X_scaled.mean(axis=0)


Out[9]:
array([ 0.,  0.,  0.])

In [13]:
scaler = preprocessing.StandardScaler().fit(X)

In [15]:
print scaler.mean_
print scaler.scale_


[ 1.          0.          0.33333333]
[ 0.81649658  0.81649658  1.24721913]

In [25]:
# try use robust_scale or RobustScaler
X = np.array([[5., -1, 2.],
              [3., 0., 0.],
              [5., 1., -1.],
             [0.01, 1., -1.],
             [3., 1., -1.],
             [4., 1., -1.],
             [7., 1., -1.]])
X_1 = preprocessing.robust_scale(X)
X_2 = preprocessing.scale(X)

In [23]:
X_1


Out[23]:
array([[ 0.5  , -4.   ,  6.   ],
       [-0.5  , -2.   ,  2.   ],
       [ 0.5  ,  0.   ,  0.   ],
       [-1.995,  0.   ,  0.   ],
       [-0.5  ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ],
       [ 1.5  ,  0.   ,  0.   ]])

In [21]:
X_1.mean(axis=0)


Out[21]:
array([-0.07071429, -0.85714286,  1.14285714])

In [24]:
X_2.mean(axis=0)


Out[24]:
array([  2.22044605e-16,   1.26882631e-16,  -3.17206578e-17])

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.


In [28]:
X_norm = preprocessing.normalize(X, norm='l2')

In [29]:
X_norm


Out[29]:
array([[ 0.91287093, -0.18257419,  0.36514837],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.96225045,  0.19245009, -0.19245009],
       [ 0.00707089,  0.7070891 , -0.7070891 ],
       [ 0.90453403,  0.30151134, -0.30151134],
       [ 0.94280904,  0.23570226, -0.23570226],
       [ 0.98019606,  0.14002801, -0.14002801]])

Encoding categorical features


In [ ]:
enc =