Using dstoolbox transformers

Table of contents

Imports


In [1]:
import re

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [3]:
from dstoolbox.transformers import ItemSelector
from dstoolbox.transformers import XLabelEncoder
from dstoolbox.transformers import ParallelFunctionTransformer
from dstoolbox.transformers import ToDataFrame
from dstoolbox.transformers import Padder2d
from dstoolbox.transformers import Padder3d
from dstoolbox.transformers import TextFeaturizer

Slicing

ItemSelector

Select a column or a slice along axis=1 from a numpy array.


In [4]:
X = np.eye(5)

In [5]:
X


Out[5]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

Just using the 2nd column


In [6]:
# pass a slice object
pipeline = Pipeline([
    ('selector', ItemSelector(slice(1, 2))),
    ('scaler', StandardScaler()),
])

In [7]:
pipeline.fit_transform(X)


Out[7]:
array([[-0.5],
       [ 2. ],
       [-0.5],
       [-0.5],
       [-0.5]])

In [8]:
# or a list
pipeline = Pipeline([
    ('selector', ItemSelector([1])),
    ('scaler', StandardScaler()),
])

In [9]:
pipeline.fit_transform(X)


Out[9]:
array([[-0.5],
       [ 2. ],
       [-0.5],
       [-0.5],
       [-0.5]])

Using the 2nd and 4th column


In [10]:
pipeline = Pipeline([
    ('selector', ItemSelector([1, 3])),
    ('scaler', StandardScaler()),
])

In [11]:
pipeline.fit_transform(X)


Out[11]:
array([[-0.5, -0.5],
       [ 2. , -0.5],
       [-0.5, -0.5],
       [-0.5,  2. ],
       [-0.5, -0.5]])

Using a slice


In [12]:
pipeline = Pipeline([
    ('selector', ItemSelector(np.s_[2:6:2])),
    ('scaler', StandardScaler()),
])

In [13]:
pipeline.fit_transform(X)


Out[13]:
array([[-0.5, -0.5],
       [-0.5, -0.5],
       [ 2. , -0.5],
       [-0.5, -0.5],
       [-0.5,  2. ]])

Apply slicing on pandas DataFrame


In [14]:
X = pd.DataFrame(data={
        'names': ['Alice', 'Bob', 'Charles', 'Dora', 'Eve'],
        'surnames': ['Carroll', 'Meister', 'Darwin', 'Explorer', 'Wally'],
        'age': [14., 30., 55., 7., 25.]}
)

In [15]:
X


Out[15]:
age names surnames
0 14.0 Alice Carroll
1 30.0 Bob Meister
2 55.0 Charles Darwin
3 7.0 Dora Explorer
4 25.0 Eve Wally

In [16]:
# use a string as key
item_selector = ItemSelector('names')
item_selector.fit_transform(X)


Out[16]:
0      Alice
1        Bob
2    Charles
3       Dora
4        Eve
Name: names, dtype: object

In [17]:
# use list of strings as keys
item_selector = ItemSelector(['names', 'age'])
item_selector.fit_transform(X)


Out[17]:
names age
0 Alice 14.0
1 Bob 30.0
2 Charles 55.0
3 Dora 7.0
4 Eve 25.0

Using functions to determine columns of the DataFrame

Sometimes you don't know the column names beforehand. Then you can supply a function to ItemSelector, which will be evaluated on each column. If the column matches (i.e. the result of applying the function to the column name is true), it will be returned.


In [18]:
# only return columns that end with 'names'
def func(s):
    return s.endswith('names')

item_selector = ItemSelector(func)
item_selector.fit_transform(X)


Out[18]:
names surnames
0 Alice Carroll
1 Bob Meister
2 Charles Darwin
3 Dora Explorer
4 Eve Wally

In [19]:
# use in combination with regular expressions
pattern = re.compile(r'n*a')
item_selector = ItemSelector(key=pattern.match)

item_selector = ItemSelector(pattern.match)
item_selector.fit_transform(X)


Out[19]:
age names
0 14.0 Alice
1 30.0 Bob
2 55.0 Charles
3 7.0 Dora
4 25.0 Eve

Forcing a 2d shape

sklearn transformers often require a 2d array as input. In that case, use force_2d=True argument.

This would raise a warning:

pipeline = Pipeline([
    ('selector', ItemSelector('age')),
    ('scaler', StandardScaler()),
])
pipeline.fit_transform(X)

This works:


In [20]:
pipeline = Pipeline([
    ('selector', ItemSelector('age', force_2d=True)),
    ('scaler', StandardScaler()),
])
pipeline.fit_transform(X)


Out[20]:
array([[-0.73897334],
       [ 0.23017202],
       [ 1.74446165],
       [-1.16297444],
       [-0.0726859 ]])

Encoding

XLabelEncoder

sklearn's LabelEncoder is intended for use in conjunction with target data. However, sometimes, we would like to encode feature data. The problem is that the LabelEncoder will raise an error when new samples are encountered. The XLabelEncoder will encode new samples to the value 0 instead. Furthermore, the encoded data will have shape n x 1, so that they can later be used as feature data (e.g. in a FeatureUnion).


In [21]:
X = np.array(['a', 'b', 'c', 'a', 'c'])

In [22]:
encoder = XLabelEncoder().fit(X)

When all labels are known, XLabelEncoder maps to the values 1..n. It returns a 2d array.


In [23]:
encoder.transform(X)


Out[23]:
array([[1],
       [2],
       [3],
       [1],
       [3]])

When new labels are encountered, they are mapped to 0.


In [24]:
encoder.transform(np.array(['a', 'b', 'c', 'd', 'e', 'a']))


Out[24]:
array([[1],
       [2],
       [3],
       [0],
       [0],
       [1]])

Preprocessing

ParallelFunctionTransformer

The ParallelFunctionTransformer, as its name suggests, transforms data in a parallelized manner. The data will be partitioned into n_jobs equally sized parts and then be transformed in parallel.

As parallelization induces overhead, use this only when the map function is slow. Furthermore, some functions don't lend themselves to parallelization, as shown below.


In [25]:
X = np.arange(10).reshape(-1, 1)

Remember: We cannot use a lambda function, since it cannot be pickled.


In [26]:
def plus_one_func(X):
    return X + 1

In [27]:
transformer = ParallelFunctionTransformer(func=plus_one_func, n_jobs=2)

In [28]:
transformer.fit_transform(X)


Out[28]:
array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

Caution

Functions such as 'adding the standard deviation' or 'divding by the max values' will not work for n_jobs > 1, because they require information of the whole data.


In [29]:
def max_of(X):
    return np.ones_like(X) * np.max(X)

In [30]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=1)

In [31]:
transformer.fit_transform(X)


Out[31]:
array([[9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9]])

In [32]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=2)

In [33]:
transformer.fit_transform(X)


Out[33]:
array([[4],
       [4],
       [4],
       [4],
       [4],
       [9],
       [9],
       [9],
       [9],
       [9]])

Casting

ToDataFrame

This is a helper class that simplifies the common use case of converting data in a Pipeline to a pandas DataFrame. It deals with a couple of types and allows to determine the column names for some of those.

numpy arrays


In [34]:
X = np.arange(5)
transformer = ToDataFrame()
transformer.fit_transform(X)


Out[34]:
0
0 0
1 1
2 2
3 3
4 4

In [35]:
X = np.eye(5)
transformer.fit_transform(X)


Out[35]:
0 1 2 3 4
0 1.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 0.0 1.0

Pass a list of strings to the columns argument to determine the DataFrame's column names.


In [36]:
transformer = ToDataFrame(columns=['col_%d' % i for i in range(5)])
transformer.fit_transform(X)


Out[36]:
col_0 col_1 col_2 col_3 col_4
0 1.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 0.0 1.0

pandas Series


In [37]:
X = pd.Series(data=np.arange(5), name='col')
transformer = ToDataFrame()
transformer.fit_transform(X)


Out[37]:
col
0 0
1 1
2 2
3 3
4 4

For series, it is possible to give another name by passing a string to the columns argument.


In [38]:
X = pd.Series(data=np.arange(5), name='col')
transformer = ToDataFrame(columns='other-name')
transformer.fit_transform(X)


Out[38]:
other-name
0 0
1 1
2 2
3 3
4 4

dicts

When a dict is passed, the keys determine the column names. This can be especially useful in conjunction with dstoolbox.pipeline.DictFeatureUnion.


In [39]:
X = {'col0': np.arange(5), 'col1': np.linspace(0, 1, 5)}
transformer = ToDataFrame()
transformer.fit_transform(X)


Out[39]:
col0 col1
0 0 0.00
1 1 0.25
2 2 0.50
3 3 0.75
4 4 1.00

lists

ToDataFrame also works when a simple list of data needs to be transformed to a DataFrame. Again, you may or may not set the columns argument.


In [40]:
X = [5, 4, 3, 2, 1]
transformer = ToDataFrame(columns=['my-col'])
transformer.fit_transform(X)


Out[40]:
my-col
0 5
1 4
2 3
3 2
4 1

Padding

Padder2d

Sometimes, we have heterogeneous arrays and would like to homogenize them. E.g., think of encoding words in sentences as indices. Since sentences typically have different number of words, the output could look something like this:


In [41]:
X = [
    [],
    [0, 1, 2],
    [10, 11, 12, 13, 14, 15],
    [100, 101, 102, 103],
]

With Padder2d, we can make the data homogenous. For this, we have to determine a max length that a sentence is allowed to have, in thise case 4:


In [42]:
Padder2d(max_len=4).fit_transform(X)


Out[42]:
array([[   0.,    0.,    0.,    0.],
       [   0.,    1.,    2.,    0.],
       [  10.,   11.,   12.,   13.],
       [ 100.,  101.,  102.,  103.]], dtype=float32)

Other pad value

We have padded short sequences with 0 but that may not be what we want. Here will pad with -1 instead:


In [43]:
Padder2d(max_len=4, pad_value=-1).fit_transform(X)


Out[43]:
array([[  -1.,   -1.,   -1.,   -1.],
       [   0.,    1.,    2.,   -1.],
       [  10.,   11.,   12.,   13.],
       [ 100.,  101.,  102.,  103.]], dtype=float32)

Other dtype

By default, Padder2d returns a np.float32 array. We may also indicate other dtypes:


In [44]:
Padder2d(max_len=4, dtype=np.int64).fit_transform(X)


Out[44]:
array([[  0,   0,   0,   0],
       [  0,   1,   2,   0],
       [ 10,  11,  12,  13],
       [100, 101, 102, 103]])

Padder3d

Sometimes, we have heterogeneous 3d arrays and would like to homogenize them. E.g., think of encoding sentences as indices. But instead of encoding each word as an int, we want to encode each character in each word as an int. The result would be a heterogeneous 3d array.


In [45]:
X = [
    [],
    [[0, 0], [1, 1, 1], [2]],
    [[10], [], [12, 12, 12, 12, 12], [13], [14, 14], [15]],
    [[100], [101, 102, 103, 104], [102], [103, 104, 105]],
]

Our machine learning algorith may require a homogeneous 3d array, though. This is where Padder3d comes into play. Say we want to have up to 4 characters per word and up to 3 words per sentence. This is how to achieve this:


In [46]:
Padder3d(max_size=(4, 3)).fit_transform(X)


Out[46]:
array([[[   0.,    0.,    0.],
        [   0.,    0.,    0.],
        [   0.,    0.,    0.],
        [   0.,    0.,    0.]],

       [[   0.,    0.,    0.],
        [   1.,    1.,    1.],
        [   2.,    0.,    0.],
        [   0.,    0.,    0.]],

       [[  10.,    0.,    0.],
        [   0.,    0.,    0.],
        [  12.,   12.,   12.],
        [  13.,    0.,    0.]],

       [[ 100.,    0.,    0.],
        [ 101.,  102.,  103.],
        [ 102.,    0.,    0.],
        [ 103.,  104.,  105.]]], dtype=float32)

Other pad value

We have padded short sequences with 0 but that may not be what we want. Here will pad with -1 instead:


In [47]:
Padder3d(max_size=(4, 3), pad_value=-1).fit_transform(X)


Out[47]:
array([[[  -1.,   -1.,   -1.],
        [  -1.,   -1.,   -1.],
        [  -1.,   -1.,   -1.],
        [  -1.,   -1.,   -1.]],

       [[   0.,    0.,   -1.],
        [   1.,    1.,    1.],
        [   2.,   -1.,   -1.],
        [  -1.,   -1.,   -1.]],

       [[  10.,   -1.,   -1.],
        [  -1.,   -1.,   -1.],
        [  12.,   12.,   12.],
        [  13.,   -1.,   -1.]],

       [[ 100.,   -1.,   -1.],
        [ 101.,  102.,  103.],
        [ 102.,   -1.,   -1.],
        [ 103.,  104.,  105.]]], dtype=float32)

Other dtype

By default, Padder3d returns a np.float32 array. We may also indicate other dtypes:


In [48]:
Padder3d(max_size=(4, 3), dtype=np.int64).fit_transform(X)


Out[48]:
array([[[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]],

       [[  0,   0,   0],
        [  1,   1,   1],
        [  2,   0,   0],
        [  0,   0,   0]],

       [[ 10,   0,   0],
        [  0,   0,   0],
        [ 12,  12,  12],
        [ 13,   0,   0]],

       [[100,   0,   0],
        [101, 102, 103],
        [102,   0,   0],
        [103, 104, 105]]])

Text

TextFeaturizer

The TextFeaturizer transforms a list or array of strings to an array of lists of ints, where each int represents a word.


In [49]:
text = [
    'To be, or not to be--that is the question:',
    "Whether 'tis nobler in the mind to suffer",
    'The slings and arrows of outrageous fortune',
    'Or to take arms against a sea of troubles',
    'And by opposing end them. To die, to sleep--',
    'No more--and by a sleep to say we end',
    'The heartache, and the thousand natural shocks',
    "That flesh is heir to. 'Tis a consummation",
    'Devoutly to be wished. To die, to sleep--',
    "To sleep--perchance to dream: ay, there's the rub,",
    'For in that sleep of death what dreams may come',
    'When we have shuffled off this mortal coil,',
    "Must give us pause. There's the respect",
    'That makes calamity of so long life.',
]

In [50]:
featurizer = TextFeaturizer()

In [51]:
featurizer.fit_transform(text)


Out[51]:
array([list([64, 5, 41, 37, 64, 5, 57, 25, 58, 45]),
       list([70, 63, 36, 24, 58, 30, 64, 55]),
       list([58, 53, 1, 3, 38, 42, 19]),
       list([41, 64, 56, 2, 0, 49, 38, 65]),
       list([1, 6, 40, 16, 59, 64, 13, 64, 52]),
       list([35, 31, 1, 6, 52, 64, 48, 67, 16]),
       list([58, 22, 1, 58, 62, 34, 50]),
       list([57, 17, 25, 23, 64, 63, 10]),
       list([12, 64, 5, 71, 64, 13, 64, 52]),
       list([64, 52, 44, 64, 14, 4, 60, 58, 47]),
       list([18, 24, 57, 52, 38, 11, 68, 15, 29, 9]),
       list([69, 67, 21, 51, 39, 61, 32, 8]),
       list([33, 20, 66, 43, 60, 58, 46]),
       list([57, 28, 7, 38, 54, 27, 26])], dtype=object)

Parameters

TextFeaturizer supports the same parameters as sklearn's CountVectorizer and TfidfVectorizer. These include

  • setting max_features,
  • using n-grams,
  • choosing between word and character level,

to name but a few.


In [52]:
# limit vocabulary to 10 words
featurizer = TextFeaturizer(max_features=10)
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())


[list([9, 1, 9, 1, 6, 7]) list([7, 9]) list([7, 0, 4]) list([9, 4])
 list([0, 3, 9, 2, 9, 5])]
['and', 'be', 'die', 'end', 'of', 'sleep', 'that', 'the', 'there', 'to']

In [53]:
# use word bigrams
featurizer = TextFeaturizer(ngram_range=(2, 2), max_features=10)
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())


[list([7, 7]) list([]) list([]) list([0]) list([1, 8, 2, 9])]
['against sea', 'and by', 'die to', 'that makes', 'that sleep', 'the heartache', 'there the', 'to be', 'to die', 'to sleep']

In [54]:
# use character unigrams
featurizer = TextFeaturizer(analyzer='char')
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())


[ list([24, 19, 0, 7, 10, 2, 0, 19, 22, 0, 18, 19, 24, 0, 24, 19, 0, 7, 10, 3, 3, 24, 13, 6, 24, 0, 14, 23, 0, 24, 13, 10, 0, 21, 25, 10, 23, 24, 14, 19, 18, 5])
 list([27, 13, 10, 24, 13, 10, 22, 0, 1, 24, 14, 23, 0, 18, 19, 7, 16, 10, 22, 0, 14, 18, 0, 24, 13, 10, 0, 17, 14, 18, 9, 0, 24, 19, 0, 23, 25, 11, 11, 10, 22])
 list([24, 13, 10, 0, 23, 16, 14, 18, 12, 23, 0, 6, 18, 9, 0, 6, 22, 22, 19, 27, 23, 0, 19, 11, 0, 19, 25, 24, 22, 6, 12, 10, 19, 25, 23, 0, 11, 19, 22, 24, 25, 18, 10])
 list([19, 22, 0, 24, 19, 0, 24, 6, 15, 10, 0, 6, 22, 17, 23, 0, 6, 12, 6, 14, 18, 23, 24, 0, 6, 0, 23, 10, 6, 0, 19, 11, 0, 24, 22, 19, 25, 7, 16, 10, 23])
 list([6, 18, 9, 0, 7, 28, 0, 19, 20, 20, 19, 23, 14, 18, 12, 0, 10, 18, 9, 0, 24, 13, 10, 17, 4, 0, 24, 19, 0, 9, 14, 10, 2, 0, 24, 19, 0, 23, 16, 10, 10, 20, 3, 3])]
[' ', "'", ',', '-', '.', ':', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'y']

unknown words

In addition, dstoolbox provides the unknown_token parameter. If this is set to None -- the default -- words that are out-of-vocabulary are simply dropped. Set this to a string and those words are instead replaced by a special index for all out-of-vocabulary words.


In [55]:
# each word that does not appear at least twice is replaced by '<UNK>'
featurizer = TextFeaturizer(unknown_token='<UNK>', min_df=2)
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())


[list([14, 1, 8, 16, 14, 1, 10, 6, 11, 16])
 list([16, 13, 16, 5, 11, 16, 14, 16]) list([11, 16, 0, 16, 7, 16, 16])
 list([8, 14, 16, 16, 16, 16, 7, 16]) list([0, 2, 16, 4, 16, 14, 3, 14, 9])]
['and', 'be', 'by', 'die', 'end', 'in', 'is', 'of', 'or', 'sleep', 'that', 'the', 'there', 'tis', 'to', 'we', '<UNK>']

Note: If you set the parameter max_features at the same time, the unknown_token is not considered when limiting the vocabulary size, effectively increasing the vocabulary size to max_features + 1.

combine with Padder2d

TextFeaturizer can be combined with Padder2d to create a homogeneous array of ints, making this combination the perfect pipeline if you want to transform text features so that they can be fed to a neural network that embeds these words.

If we do this, it is best to limit the vocabulary size be setting max_features to a value N, and to set unknown_token so that out-of-vocabulary words are not just dropped. Furthermore, you should set, pad_value of Padder2d to N+1. This is because the indices of the regular words range from 0 to N-1 and the index of the unknown token is N. Thus the value of the padded entries, N+1, will be unique, as shown below:


In [56]:
max_features = 10

Pipeline([
    ('featurize', TextFeaturizer(
        max_features=max_features,
        unknown_token='<UNK>',
    )),
    ('padder', Padder2d(
        max_len=15,
        pad_value=max_features + 1,
        dtype=int,
    )),
]).fit_transform(text)[:3]


Out[56]:
array([[ 9,  1, 10, 10,  9,  1,  6, 10,  7, 10, 11, 11, 11, 11, 11],
       [10, 10, 10, 10,  7, 10,  9, 10, 11, 11, 11, 11, 11, 11, 11],
       [ 7, 10,  0, 10,  4, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11]])

In [ ]: