Standardization, or mean removal and variance scaling

`scale` function



In [1]:

    
from sklearn import preprocessing
import numpy as np

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X_train)

X_scaled









    Out[1]:





array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])



In [3]:

    
X_scaled.mean(axis=0), X_scaled.std(axis=0)









    Out[3]:





(array([ 0.,  0.,  0.]), array([ 1.,  1.,  1.]))

StandardScaler



In [12]:

    
scaler = preprocessing.StandardScaler().fit(X_train)
scaler









    Out[12]:





StandardScaler(copy=True, with_mean=True, with_std=True)



In [13]:

    
scaler.mean_, scaler.scale_, scaler.transform(X_train)









    Out[13]:





(array([ 1.        ,  0.        ,  0.33333333]),
 array([ 0.81649658,  0.81649658,  1.24721913]),
 array([[ 0.        , -1.22474487,  1.33630621],
        [ 1.22474487,  0.        , -0.26726124],
        [-1.22474487,  1.22474487, -1.06904497]]))

Once a scaler instance has been fitted, it can perform the same transformation on new data:



In [5]:

    
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)









    Out[5]:





array([[-2.44948974,  1.22474487, -0.26726124]])

Scaling features to a range

MinMaxScaler



In [6]:

    
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax









    Out[6]:





array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])



In [7]:

    
X_test = np.array([[ -3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax









    Out[7]:





array([[-1.5       ,  0.        ,  1.66666667]])



In [8]:

    
min_max_scaler.scale_, min_max_scaler.min_









    Out[8]:





(array([ 0.5       ,  0.5       ,  0.33333333]),
 array([ 0.        ,  0.5       ,  0.33333333]))

MaxAbsScaler



In [9]:

    
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs                # doctest +NORMALIZE_WHITESPACE^









    Out[9]:





array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])



In [10]:

    
X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs









    Out[10]:





array([[-1.5, -1. ,  2. ]])



In [11]:

    
max_abs_scaler.scale_









    Out[11]:





array([ 2.,  1.,  2.])

`minmax_scale` function



In [14]:

    
preprocessing.minmax_scale(X_train)









    Out[14]:





array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

`maxabs_scale` function



In [15]:

    
preprocessing.maxabs_scale(X_train)









    Out[15]:





array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

Scaling Sparse Data

Centering sparse data destroys the sparseness which may be desirable. It might make sense to scale the sparse data with the other data--especially if features contain different scales. Use MaxAbsScaler in this situation.

Non-Linear Transformation

QuantileTransformer



In [18]:

    
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()

X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])









    Out[18]:





array([ 4.3,  5.1,  5.8,  6.5,  7.9])



In [19]:

    
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)

X_train_trans = quantile_transformer.fit_transform(X_train)

X_test_trans = quantile_transformer.transform(X_test)



In [20]:

    
np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])









    Out[20]:





array([  9.99999998e-08,   2.38738739e-01,   5.09009009e-01,
         7.43243243e-01,   9.99999900e-01])



In [21]:

    
np.percentile(X_test[:, 0], [0, 25, 50, 75, 100]), np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])









    Out[21]:





(array([ 4.4  ,  5.125,  5.75 ,  6.175,  7.3  ]),
 array([ 0.01351351,  0.25012513,  0.47972973,  0.6021021 ,  0.94144144]))

You can map the data to a normal distribution:



In [22]:

    
quantile_transformer = preprocessing.QuantileTransformer(output_distribution='normal', random_state=0)

X_trans = quantile_transformer.fit_transform(X)

quantile_transformer.quantiles_









    Out[22]:





array([[ 4.3       ,  2.        ,  1.        ,  0.1       ],
       [ 4.31491491,  2.02982983,  1.01491491,  0.1       ],
       [ 4.32982983,  2.05965966,  1.02982983,  0.1       ],
       ..., 
       [ 7.84034034,  4.34034034,  6.84034034,  2.5       ],
       [ 7.87017017,  4.37017017,  6.87017017,  2.5       ],
       [ 7.9       ,  4.4       ,  6.9       ,  2.5       ]])



In [ ]: