Practice the preprocessing from scikit-learn official doc [http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing]
Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
In [1]:
from sklearn import preprocessing
import numpy as np
X = np.array([[1., -1, 2.],
[2., 0., 0.],
[0., 1., -1.]])
X_scaled = preprocessing.scale(X)
In [2]:
X
Out[2]:
In [3]:
X_scaled
Out[3]:
In [4]:
X.mean(axis=0)
Out[4]:
In [9]:
X_scaled.mean(axis=0)
Out[9]:
In [13]:
scaler = preprocessing.StandardScaler().fit(X)
In [15]:
print scaler.mean_
print scaler.scale_
In [25]:
# try use robust_scale or RobustScaler
X = np.array([[5., -1, 2.],
[3., 0., 0.],
[5., 1., -1.],
[0.01, 1., -1.],
[3., 1., -1.],
[4., 1., -1.],
[7., 1., -1.]])
X_1 = preprocessing.robust_scale(X)
X_2 = preprocessing.scale(X)
In [23]:
X_1
Out[23]:
In [21]:
X_1.mean(axis=0)
Out[21]:
In [24]:
X_2.mean(axis=0)
Out[24]:
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
In [28]:
X_norm = preprocessing.normalize(X, norm='l2')
In [29]:
X_norm
Out[29]:
In [ ]:
enc =