If you are reading this, that means you are still alive. Welcome back to the reality of learning scikit-learn.
This tutorial focuses on feature engineering and covers more advanced topics in scikit-learn like feature extraction, building a pipeline, creating custom transformers, feature union, dimensionality reduction, and grid search. Feature engineering is a very important step in NLP and ML; it is not a trival task to select good features. Therefore, we are spending a lot of time on it here.
Without further ado, let's start with loading a dataset again. This time, we will use a CSV file that has more than two columns, i.e. one column for labels and mutiple columns for raw data/features. This time, we are using a subset of the Yelp Review Data Challenge dataset. Just like the 20 News Group dataset, I converted this dataset to CSV.
There are 5 star ratings (shocking), and I extracted 500 reviews for each rating. This dataset is small, because it's only intended for some quick demo; therefore, the performance of any classifier won't be too good (and this should be a regression problem instead of classification, but to make things easier, let's stick with classification). Other than extracing the text of each review, I also included other users' votes for each review, i.e funny, useful and cool. For this tutorial, our task is to predict the star rating of each review.
In [ ]:
import pandas as pd
dataset = pd.read_csv('yelp-review-subset.csv', header=0, delimiter=',', names=['stars', 'text', 'funny', 'useful', 'cool'])
# just checking the dataset
print('There are {0} star ratings, and {1} reviews'.format(len(dataset.stars.unique()), len(dataset)))
print(dataset.stars.value_counts())
With Pandas dataframe, it is very easy to select a subset of a dataframe by column names, simply pass in a list of column names. So we are going to split our data just like the previous tutorial.
In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset[['text', 'funny', 'useful', 'cool']], dataset['stars'], train_size=0.8)
The difference now is that we have four columns in our raw data. Three of them are funny, useful and cool which contain numeric values, which is perfect for scikit-learn as it expects values of features to be numeric or indexes. What we need to do is to extract features from the text column.
In [ ]:
print(X_train.columns)
We can first do something very similar to the previous tutorial, we initialize a CountVectorizer object and then pass raw text dat into its fit_transform() function to index the count of each word. Please note that you should not pass in the whole X_train dataframe into the function, but only the text column, i.e. the X_train.text dataframe (more or less like an array). Otherwise, it does not extract features from your text column, but simly indexes all the values in each column. Please see the shape of the output of two different dataframes.
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize a CountVectorizer
cv = CountVectorizer()
# fit the raw data into the vectorizer and tranform it into a series of arrays
X_train_counts = cv.fit_transform(X_train.text)
print(X_train_counts.shape)
# this is not what you want.
cv_test = CountVectorizer()
X_train_counts_test = cv.fit_transform(X_train)
print(X_train_counts_test.shape)
Now we have a problem, the X_train_counts vector only contains features from text, but now what should we do if we want to include funny, useful and cool vote counts as feature as well?
Pipeline and FeatureUnionTo deal with that problem, we need to talk about Pipeline and FeatureUnion. Pipeline lets us define a list of steps which consists of a list of transformers to extract features from data (including FeatureUnion) and a final estimator (aka classifier). FeatureUnion basically concatenates results of multiple transformer objects. The following is a complete example of how to use these two together.
In [ ]:
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report
class ItemSelector(TransformerMixin, BaseEstimator):
"""This class allows you to select a subset of a dataframe based on given column name(s)."""
def __init__(self, keys):
self.keys = keys
def fit(self, x, y=None):
return self
def transform(self, dataframe):
return dataframe[self.keys]
class VotesToDictTransformer(TransformerMixin, BaseEstimator):
"""This tranformer converts the vote counts of each row into a dictionary."""
def fit(self, x, y=None):
return self
def transform(self, votes):
funny, useful, cool = votes['funny'], votes['useful'], votes['cool']
return [{'funny': binarize_number(f, 1), 'useful': binarize_number(u, 1), 'cool': binarize_number(c, 1)}
for f, u, c in zip(funny, useful, cool)]
def binarize_number(num, threshold):
return 0 if num < threshold else 1
pipeline = Pipeline([
# Use FeatureUnion to combine the features from text and votes
('union', FeatureUnion(
transformer_list=[
# Pipeline for getting BOW features from the texts
('bag-of-words', Pipeline([
('selector', ItemSelector(keys='text')),
('counts', CountVectorizer()),
])),
# Pipeline for getting vote counts as features
# the DictVecotrizer object there transform indexes the values of the dictionaries
# passed down from the VotesToDictTransformer object.
('votes', Pipeline([
('selector', ItemSelector(keys=['funny', 'useful', 'cool'])),
('votes_to_dict', VotesToDictTransformer()),
('vectorizer', DictVectorizer()),
])),
],
# weight components in FeatureUnion
transformer_weights={
'bag-of-words': 1.0,
'votes': 0.5
},
)),
# Use a naive bayes classifier on the combined features
('clf', LogisticRegression()),
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
print(classification_report(predicted, y_test))
In the previous section, I defined two classes ItemSelector and VotesToDictTransformer, and the commonality of these two is that they inherited the TransformerMixin class. TransformerMixin is the base class of many built-in transformers and vectorizers in scikit-learn, e.g. CountVectorizer, TfidfVectorizer, TfidfTransformer, DictVectorizer, etc. We define the transform() function to manipulate the data in a more custom way. For example, ItemSelector returns a subset of dataframe based on given column names, and VotesToDictTransformer transforms a dataframe into a list of dictionaries.
To demonstrate how useful customer transformers are, let's define another one. Say, we hypothesize that the sentiment of each review can be a strong feature for predict the star rating. Then we would need a SentimentTransformer class.
To avoid spending time to train our own sentiment classifier, we use the TextBlob package for its built-in sentiment analysis feature.
In [ ]:
from textblob import TextBlob
class SentimentTransformer(TransformerMixin, BaseEstimator):
def fit(self, x, y=None):
return self
def transform(self, texts):
features = []
for text in texts:
blob = TextBlob(text.decode('utf-8'))
features.append({'polarity': binarize_number(blob.sentiment.polarity, 0.5),
'subjectivity': binarize_number(blob.sentiment.subjectivity, 0.5)})
return features
Let's add that transformer to our existing pipeline, and see if the additional features help.
In [ ]:
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('bag-of-words', Pipeline([
('selector', ItemSelector(keys='text')),
('counts', CountVectorizer()),
])),
('votes', Pipeline([
('selector', ItemSelector(keys=['funny', 'useful', 'cool'])),
('votes_to_dict', VotesToDictTransformer()),
('vectorizer', DictVectorizer()),
])),
('sentiments', Pipeline([
('selector', ItemSelector(keys='text')),
('sentiment_transform', SentimentTransformer()),
('vectorizer', DictVectorizer()),
])),
],
# weight components in FeatureUnion
transformer_weights={
'bag-of-words': 1.0,
'votes': 0.5,
'sentiments': 1.0,
},
)),
# Use a naive bayes classifier on the combined features
('clf', LogisticRegression()),
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
print(classification_report(predicted, y_test))
Two major problems of using bag-of-words as features are (1) that it introduces noise; and (2) that it increases the dimensionality of feature space. When using bag-of-words, we simply throw in a bunch of words into the feature space and hope and pray that they work, because we don't know what words are most informative in a model. Other than handcrafting features and selecting what words to put into the feature space, we can also use the feature_selection module to automatically select informative features and eliminate noise.
"SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter." Basically, the idea is "to reduce the dimensionality of the data to use with another classifier, they can be used along with feature_selection.SelectFromModel to select the non-zero coefficients."
In this following example, we use LogisticRegression to perform feature elemination. "l2" is the penalty, and C controls the sparsity: the smaller C the fewer features selected.
In [ ]:
from sklearn.feature_selection import SelectFromModel
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('bag-of-words', Pipeline([
('selector', ItemSelector(keys='text')),
('counts', CountVectorizer()),
])),
('votes', Pipeline([
('selector', ItemSelector(keys=['funny', 'useful', 'cool'])),
('votes_to_dict', VotesToDictTransformer()),
('vectorizer', DictVectorizer()),
])),
('sentiments', Pipeline([
('selector', ItemSelector(keys='text')),
('sentiment_transform', SentimentTransformer()),
('vectorizer', DictVectorizer()),
])),
],
# weight components in FeatureUnion
transformer_weights={
'bag-of-words': 1.0,
'votes': 0.5,
'sentiments': 1.0,
},
)),
# use SelectFromModel to select informative features
('feature_selection', SelectFromModel(LogisticRegression(C=0.5, penalty="l2"))),
# Use a naive bayes classifier on the combined features
('clf', LogisticRegression()),
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
print(classification_report(predicted, y_test))
Finally, many classifiers like logistic regression or SVM have certain parameters to tweak for them to get optimal results, and it can be pain in the neck for one to try every combination. Grid Search is an automated method to try every combination and rank the best combinations. In scikit-learn, we use model_selection.GridSearchCV. The bare minimum set of parameters for grid search is an estimator object and a list or dictionary of parameters. In our case, we are passing a pipline and a dictionary (Pipeline inherits BaseEstimator).
Parameters of the estimators in the pipeline can be accessed using the max_iter and C values, and the name of the LogisticRegression in the pipeline is clf. Therefore, in the dictionary there are two entries: ``clfandclf__C``.
In [ ]:
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('bag-of-words', Pipeline([
('selector', ItemSelector(keys='text')),
('counts', CountVectorizer()),
])),
('votes', Pipeline([
('selector', ItemSelector(keys=['funny', 'useful', 'cool'])),
('votes_to_dict', VotesToDictTransformer()),
('vectorizer', DictVectorizer()),
])),
('sentiments', Pipeline([
('selector', ItemSelector(keys='text')),
('sentiment_transform', SentimentTransformer()),
('vectorizer', DictVectorizer()),
])),
],
# weight components in FeatureUnion
transformer_weights={
'bag-of-words': 1.0,
'votes': 0.5,
'sentiments': 1.0,
},
)),
# Use a naive bayes classifier on the combined features
('clf', LogisticRegression()),
])
params = dict(clf__max_iter=[50, 100, 150], clf__C=[1.0, 0.5, 0.1])
grid_search = GridSearchCV(pipeline, param_grid=params)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
According to the output above, when C=0.1 and max_iter=50, we get the best results. To validate the results, let's use these values to train and test a model.
In [ ]:
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('bag-of-words', Pipeline([
('selector', ItemSelector(keys='text')),
('counts', CountVectorizer()),
])),
('votes', Pipeline([
('selector', ItemSelector(keys=['funny', 'useful', 'cool'])),
('votes_to_dict', VotesToDictTransformer()),
('vectorizer', DictVectorizer()),
])),
('sentiments', Pipeline([
('selector', ItemSelector(keys='text')),
('sentiment_transform', SentimentTransformer()),
('vectorizer', DictVectorizer()),
])),
],
# weight components in FeatureUnion
transformer_weights={
'bag-of-words': 1.0,
'votes': 0.5,
'sentiments': 1.0,
},
)),
# Use a naive bayes classifier on the combined features
('clf', LogisticRegression(C=0.1, max_iter=50)),
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
print(classification_report(predicted, y_test))
As we can see, with almost everything the same as the pipeline in Section 2, changing the value of C and that of max_iter improves our results (the default value of C is 1.0 and that of max_iter is 100).
This is just a simple overview of performing feature engineering in scikit-learn, and there are many different models that you can try. For example, with GridSearchCV, you can even try comparing the performance of different classifiers. This two-part tutorial is to help you get familiar and comfortable with scikit-learn and its main modules. Please check its documentation if you need more clarification on how to do certain things, since scikit-learn is one of the best documented libraries that I know of!