In [1]:
import re
In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
In [3]:
from dstoolbox.transformers import ItemSelector
from dstoolbox.transformers import XLabelEncoder
from dstoolbox.transformers import ParallelFunctionTransformer
from dstoolbox.transformers import ToDataFrame
from dstoolbox.transformers import Padder2d
from dstoolbox.transformers import Padder3d
from dstoolbox.transformers import TextFeaturizer
Select a column or a slice along axis=1
from a numpy array.
In [4]:
X = np.eye(5)
In [5]:
X
Out[5]:
In [6]:
# pass a slice object
pipeline = Pipeline([
('selector', ItemSelector(slice(1, 2))),
('scaler', StandardScaler()),
])
In [7]:
pipeline.fit_transform(X)
Out[7]:
In [8]:
# or a list
pipeline = Pipeline([
('selector', ItemSelector([1])),
('scaler', StandardScaler()),
])
In [9]:
pipeline.fit_transform(X)
Out[9]:
In [10]:
pipeline = Pipeline([
('selector', ItemSelector([1, 3])),
('scaler', StandardScaler()),
])
In [11]:
pipeline.fit_transform(X)
Out[11]:
In [12]:
pipeline = Pipeline([
('selector', ItemSelector(np.s_[2:6:2])),
('scaler', StandardScaler()),
])
In [13]:
pipeline.fit_transform(X)
Out[13]:
In [14]:
X = pd.DataFrame(data={
'names': ['Alice', 'Bob', 'Charles', 'Dora', 'Eve'],
'surnames': ['Carroll', 'Meister', 'Darwin', 'Explorer', 'Wally'],
'age': [14., 30., 55., 7., 25.]}
)
In [15]:
X
Out[15]:
In [16]:
# use a string as key
item_selector = ItemSelector('names')
item_selector.fit_transform(X)
Out[16]:
In [17]:
# use list of strings as keys
item_selector = ItemSelector(['names', 'age'])
item_selector.fit_transform(X)
Out[17]:
Sometimes you don't know the column names beforehand. Then you can supply a function to ItemSelector
, which will be evaluated on each column. If the column matches (i.e. the result of applying the function to the column name is true), it will be returned.
In [18]:
# only return columns that end with 'names'
def func(s):
return s.endswith('names')
item_selector = ItemSelector(func)
item_selector.fit_transform(X)
Out[18]:
In [19]:
# use in combination with regular expressions
pattern = re.compile(r'n*a')
item_selector = ItemSelector(key=pattern.match)
item_selector = ItemSelector(pattern.match)
item_selector.fit_transform(X)
Out[19]:
sklearn
transformers often require a 2d array as input. In that case, use force_2d=True
argument.
This would raise a warning:
pipeline = Pipeline([
('selector', ItemSelector('age')),
('scaler', StandardScaler()),
])
pipeline.fit_transform(X)
This works:
In [20]:
pipeline = Pipeline([
('selector', ItemSelector('age', force_2d=True)),
('scaler', StandardScaler()),
])
pipeline.fit_transform(X)
Out[20]:
sklearn
's LabelEncoder
is intended for use in conjunction with target data. However, sometimes, we would like to encode feature data. The problem is that the LabelEncoder
will raise an error when new samples are encountered. The XLabelEncoder
will encode new samples to the value 0
instead. Furthermore, the encoded data will have shape n x 1
, so that they can later be used as feature data (e.g. in a FeatureUnion
).
In [21]:
X = np.array(['a', 'b', 'c', 'a', 'c'])
In [22]:
encoder = XLabelEncoder().fit(X)
In [23]:
encoder.transform(X)
Out[23]:
In [24]:
encoder.transform(np.array(['a', 'b', 'c', 'd', 'e', 'a']))
Out[24]:
The ParallelFunctionTransformer
, as its name suggests, transforms data in a parallelized manner. The data will be partitioned into n_jobs
equally sized parts and then be transformed in parallel.
As parallelization induces overhead, use this only when the map function is slow. Furthermore, some functions don't lend themselves to parallelization, as shown below.
In [25]:
X = np.arange(10).reshape(-1, 1)
Remember: We cannot use a lambda
function, since it cannot be pickled.
In [26]:
def plus_one_func(X):
return X + 1
In [27]:
transformer = ParallelFunctionTransformer(func=plus_one_func, n_jobs=2)
In [28]:
transformer.fit_transform(X)
Out[28]:
Functions such as 'adding the standard deviation' or 'divding by the max values' will not work for n_jobs > 1
, because they require information of the whole data.
In [29]:
def max_of(X):
return np.ones_like(X) * np.max(X)
In [30]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=1)
In [31]:
transformer.fit_transform(X)
Out[31]:
In [32]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=2)
In [33]:
transformer.fit_transform(X)
Out[33]:
This is a helper class that simplifies the common use case of converting data in a Pipeline
to a pandas DataFrame
. It deals with a couple of types and allows to determine the column names for some of those.
In [34]:
X = np.arange(5)
transformer = ToDataFrame()
transformer.fit_transform(X)
Out[34]:
In [35]:
X = np.eye(5)
transformer.fit_transform(X)
Out[35]:
Pass a list of strings to the columns
argument to determine the DataFrame
's column names.
In [36]:
transformer = ToDataFrame(columns=['col_%d' % i for i in range(5)])
transformer.fit_transform(X)
Out[36]:
In [37]:
X = pd.Series(data=np.arange(5), name='col')
transformer = ToDataFrame()
transformer.fit_transform(X)
Out[37]:
For series, it is possible to give another name by passing a string to the columns
argument.
In [38]:
X = pd.Series(data=np.arange(5), name='col')
transformer = ToDataFrame(columns='other-name')
transformer.fit_transform(X)
Out[38]:
When a dict
is passed, the keys determine the column names. This can be especially useful in conjunction with dstoolbox.pipeline.DictFeatureUnion
.
In [39]:
X = {'col0': np.arange(5), 'col1': np.linspace(0, 1, 5)}
transformer = ToDataFrame()
transformer.fit_transform(X)
Out[39]:
ToDataFrame
also works when a simple list
of data needs to be transformed to a DataFrame
. Again, you may or may not set the columns
argument.
In [40]:
X = [5, 4, 3, 2, 1]
transformer = ToDataFrame(columns=['my-col'])
transformer.fit_transform(X)
Out[40]:
Sometimes, we have heterogeneous arrays and would like to homogenize them. E.g., think of encoding words in sentences as indices. Since sentences typically have different number of words, the output could look something like this:
In [41]:
X = [
[],
[0, 1, 2],
[10, 11, 12, 13, 14, 15],
[100, 101, 102, 103],
]
With Padder2d
, we can make the data homogenous. For this, we have to determine a max length that a sentence is allowed to have, in thise case 4:
In [42]:
Padder2d(max_len=4).fit_transform(X)
Out[42]:
We have padded short sequences with 0 but that may not be what we want. Here will pad with -1 instead:
In [43]:
Padder2d(max_len=4, pad_value=-1).fit_transform(X)
Out[43]:
By default, Padder2d
returns a np.float32
array. We may also indicate other dtypes:
In [44]:
Padder2d(max_len=4, dtype=np.int64).fit_transform(X)
Out[44]:
Sometimes, we have heterogeneous 3d arrays and would like to homogenize them. E.g., think of encoding sentences as indices. But instead of encoding each word as an int, we want to encode each character in each word as an int. The result would be a heterogeneous 3d array.
In [45]:
X = [
[],
[[0, 0], [1, 1, 1], [2]],
[[10], [], [12, 12, 12, 12, 12], [13], [14, 14], [15]],
[[100], [101, 102, 103, 104], [102], [103, 104, 105]],
]
Our machine learning algorith may require a homogeneous 3d array, though. This is where Padder3d
comes into play. Say we want to have up to 4 characters per word and up to 3 words per sentence. This is how to achieve this:
In [46]:
Padder3d(max_size=(4, 3)).fit_transform(X)
Out[46]:
We have padded short sequences with 0 but that may not be what we want. Here will pad with -1 instead:
In [47]:
Padder3d(max_size=(4, 3), pad_value=-1).fit_transform(X)
Out[47]:
By default, Padder3d
returns a np.float32
array. We may also indicate other dtypes:
In [48]:
Padder3d(max_size=(4, 3), dtype=np.int64).fit_transform(X)
Out[48]:
The TextFeaturizer
transforms a list or array of strings to an array of lists of ints, where each int represents a word.
In [49]:
text = [
'To be, or not to be--that is the question:',
"Whether 'tis nobler in the mind to suffer",
'The slings and arrows of outrageous fortune',
'Or to take arms against a sea of troubles',
'And by opposing end them. To die, to sleep--',
'No more--and by a sleep to say we end',
'The heartache, and the thousand natural shocks',
"That flesh is heir to. 'Tis a consummation",
'Devoutly to be wished. To die, to sleep--',
"To sleep--perchance to dream: ay, there's the rub,",
'For in that sleep of death what dreams may come',
'When we have shuffled off this mortal coil,',
"Must give us pause. There's the respect",
'That makes calamity of so long life.',
]
In [50]:
featurizer = TextFeaturizer()
In [51]:
featurizer.fit_transform(text)
Out[51]:
TextFeaturizer
supports the same parameters as sklearn's CountVectorizer
and TfidfVectorizer
. These include
max_features
,to name but a few.
In [52]:
# limit vocabulary to 10 words
featurizer = TextFeaturizer(max_features=10)
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())
In [53]:
# use word bigrams
featurizer = TextFeaturizer(ngram_range=(2, 2), max_features=10)
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())
In [54]:
# use character unigrams
featurizer = TextFeaturizer(analyzer='char')
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())
In addition, dstoolbox provides the unknown_token
parameter. If this is set to None -- the default -- words that are out-of-vocabulary are simply dropped. Set this to a string and those words are instead replaced by a special index for all out-of-vocabulary words.
In [55]:
# each word that does not appear at least twice is replaced by '<UNK>'
featurizer = TextFeaturizer(unknown_token='<UNK>', min_df=2)
print(featurizer.fit_transform(text)[:5])
print(featurizer.get_feature_names())
Note: If you set the parameter max_features
at the same time, the unknown_token
is not considered when limiting the vocabulary size, effectively increasing the vocabulary size to max_features + 1
.
TextFeaturizer
can be combined with Padder2d
to create a homogeneous array of ints, making this combination the perfect pipeline if you want to transform text features so that they can be fed to a neural network that embeds these words.
If we do this, it is best to limit the vocabulary size be setting max_features
to a value N, and to set unknown_token
so that out-of-vocabulary words are not just dropped. Furthermore, you should set, pad_value
of Padder2d
to N+1. This is because the indices of the regular words range from 0 to N-1 and the index of the unknown token is N. Thus the value of the padded entries, N+1, will be unique, as shown below:
In [56]:
max_features = 10
Pipeline([
('featurize', TextFeaturizer(
max_features=max_features,
unknown_token='<UNK>',
)),
('padder', Padder2d(
max_len=15,
pad_value=max_features + 1,
dtype=int,
)),
]).fit_transform(text)[:3]
Out[56]:
In [ ]: