Note: This notebook uses GraphLab Create 1.0.
Feature engineering is one of the most important factors in a successful machine learning project. With GraphLab Create's SFrame, we have at our disposal the tools to make this a painless and fun ride!
Be sure to check out the Introduction to Regression Analysis notebook which provides an overview of how to use GraphLab Create to train regression and classification models, make predictions, and evaluate performance.
In this notebook, we will go over:
A preview of some fun facts revealed by our regression model:*
*Of course, these relationships are correlations, not causations.
Let us start by importing GraphLab Create!
In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')
Like the Introduction to Regression Analysis notebook, we use data from the Yelp Dataset Challenge for this tutorial. The task is to predict the 'star rating' of a restaurant for a given user.
The data consists of three tables containing info about 11,537 businesses, 8,282 checkin sets, 43,873 users, and 229,907 reviews. Details about each of the columns in the dataset are available on the Yelp website. Please see the Introduction to Regression Analysis notebook for more details about the data preparation phase.
In [2]:
business = gl.SFrame('https://static.turi.com/datasets/regression/business.csv')
user = gl.SFrame('https://static.turi.com/datasets/regression/user.csv')
review = gl.SFrame('https://static.turi.com/datasets/regression/review.csv')
In [3]:
review_business_table = review.join(business, how='inner', on='business_id')
review_business_table = review_business_table.rename({'stars.1': 'business_avg_stars',
'type.1': 'business_type',
'review_count': 'business_review_count'})
user_business_review_table = review_business_table.join(user, how='inner', on='user_id')
user_business_review_table = user_business_review_table.rename({'name.1': 'user_name',
'type.1': 'user_type',
'average_stars': 'user_avg_stars',
'review_count': 'user_review_count'})
Let us split the prepared data into training and testing sets.
In [4]:
train_set, test_set = user_business_review_table.random_split(0.8, seed=1)
GraphLab Create can handle features of the following types:
GraphLab Create's SFrame data structure is not currently optimized for holding more than a few hundred columns, but the list and dictionary types allow observations to have a much larger number of features in a single SFrame column. In this demo we will see examples of each of these complex feature types.
Categorical features usually require special attention in regression models. Often, this includes a messy pre-processing step to make the categorical features interpretable in the context of the model. With GraphLab Create, we can add categorical features without any special pre-processing. Throw in your features as strings and we will do all the munging for you!
The variable city is a categorical variable in our dataset. We can use the SArray's handy sketch_summary() function to get a quick glance at some useful statistics.
In [5]:
train_set['city'].show()
We see there are about 60 unique strings for the city (the number of unique values is approximate in the GraphLab Sketch). The regression module in GraphLab Create uses simple encoding while training models using string features. Simple encoding compares each category to an arbitrary reference category (we choose the first category the data as the reference). This article provides more details about simple encoding for regression models.
In our Introduction to Regression notebook, we used the following numerical features:
Now we can easiliy add the city feature to the model.
In [6]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_avg_stars','business_avg_stars',
'user_review_count', 'business_review_count',
'city'])
Notice that the number of coefficients and the number of features aren't the same. We add dummy coefficients to encode each category. The number of these dummy coefficients is equal to the total number of categories minus 1 (for the reference category). In this example, there are 60 unique cities, so 59 dummy coefficients.
Let us see how well the model performed:
In [7]:
model.evaluate(test_set)
Out[7]:
On average, our predicted rating was about 1 star away from the true rating but there were some corner cases which were off by almost 4 stars. Let's inspect the model to get more insight:
In [8]:
model.summary()
The coefficients for the categorical variables indicate the strength of the association between a category and the rating. In this example, we learn that restauraunts in Wittman, AZ and Florence, SC are likely to have worse ratings in comparison with Good Year, AZ and Grand Junction, CO, holding the other features constant.
Wikipedia claims Grand Junction, CO was number six in Outdoor Life's 2012 list of the 35 Best Hunting and Fishing Towns in the US. I am definitely planning my next vacation to Grand Junction.
GraphLab Create is built to scale. It can handle categorical variables with millions of categories, which we illustrate now by adding the user-id and business-id variables to our model.
Notice that we omit the city as a feature because the business ID uniquely identifies the city. Removing this redundant information not only improves our ability to interpret the model, it makes the model mathematically tractable.
In [9]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_id','business_id',
'user_avg_stars','business_avg_stars'],
max_iterations=10)
We didn't have any trouble training the model with over 50K terms in the model. In fact, GraphLab Create can work with millions of features and billions of examples.
Note, however, that the Solver status is now TERMINATED: Iteration limit reached. In the previous model, the solver reached the best possible solution in only one pass over the data, but in this case it reached the default stopping point of 10 iterations without finding the optimal solution. We can increase the iteration limit with the max_iterations option. With 100 iterations, the results do look slightly different.
In [10]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_id','business_id',
'user_avg_stars','business_avg_stars'],
max_iterations=100)
GraphLab Create can also handle dictionary features without any manual pre-processing. The Yelp reviews have information on the number of funny, useful, or cool votes received by each review. These tallies are stored in the dataset in the form of dictionaries, one per review.
In [11]:
train_set['votes'].head(3)
Out[11]:
We can use these dictionaries to create a linear regression model without manual data munging. Each key in a dictionary feature is treated as a separate term in the model.
In [12]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_id','business_id',
'user_avg_stars','votes', 'business_avg_stars'])
Our model tells us that, all else equal:
GraphLab Create can also handle list features without preprocessing. As an illustration, suppose the values of the votes dictionary variable had instead been stored in a list. We can use the SArray apply function to simulate this situation.
In [13]:
train_set['votes_list'] = train_set['votes'].apply(lambda x: x.values())
train_set['votes_list'].head(3)
Out[13]:
As with the dictionary feature type, each entry of the list is treated as a separate term in the model.
In [14]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_id','business_id',
'user_avg_stars','votes_list', 'business_avg_stars'])
The only difference between the model trained using votes and votes_list is in the annotation of the returned coefficients. For the dictionary feature, we returned votes_list[cool] but now we return votes_list[0] (for example).
Yelp reviews contain useful tags that describe each business. Unlike categorical variables, a business may have several tags. Confusingly, these tags are stored as lists of strings in the categories column in our datatset:
In [15]:
train_set['categories'].head(5)
Out[15]:
Tag data takes a bit of pre-processing. Let us define a function that converts each list of the form [tag_1, tag_2, ..., tag_n] to a dictionary of the form {tag_1: 1, tag_2: 1, ..., tag_n: 1} where the keys are the tags and all values are 1 (indicating that the tag was present). We then apply this to each of the tag lists in our data.
In [16]:
tags_to_dict = lambda tags: dict(zip(tags, [1 for tag in tags]))
In [17]:
train_set['categories_dict'] = train_set.apply(lambda row: tags_to_dict(row['categories']))
train_set['categories_dict'].head(5)
Out[17]:
Now we create a linear regression model with the categories_dict included in the feature list:
In [18]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_id','business_id', 'categories_dict',
'user_avg_stars','votes', 'business_avg_stars'])
The model shows the tag "Food" has a slightly positive association with a business's rating, whereas "Restaurants" is slightly negatively associated. The influences are really small, so we will ingore them.
GraphLab Create's SArray has several very useful text processing capabilities. In this section, we apply these to the raw text of Yelp business reviews to improve our linear model's predictions.
For this, we use the count_words SArray function converts a raw text string into a dictionary where the keys are the words and the values are the word counts. For example, the first review is:
In [19]:
train_set['text'].head(1)
Out[19]:
And the resulting word count dictionary:
In [20]:
train_set['negative_review_tags'] = gl.text_analytics.count_words(train_set['text'])
train_set['negative_review_tags'].head(1)
Out[20]:
We now use the dict_trim_by_keys SArray function to trim the dictionary down to a set of words we are intereseted in.
Our belief is that negative words suchs as filthy and disgusting are useful in predicting when a rating will be bad, so we construct a feature that captures these words. This belief is not justified by statistical or machine learning considerations, but this type of feature engineering often makes the difference between a good model and a great model.
In [21]:
bad_review_words = ['hate','terrible', 'awful', 'spit', 'disgusting', 'filthy', 'tasteless', 'rude',
'dirty', 'slow', 'poor', 'late', 'angry', 'flies', 'disappointed', 'disappointing', 'wait',
'waiting', 'dreadful', 'appalling', 'horrific', 'horrifying', 'horrible', 'horrendous', 'atrocious',
'abominable', 'deplorable', 'abhorrent', 'frightful', 'shocking', 'hideous', 'ghastly', 'grim',
'dire', 'unspeakable', 'gruesome']
train_set['negative_review_tags'] = train_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)
In [22]:
train_set['negative_review_tags'].head(5)
Out[22]:
In [23]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_id', 'business_id', 'categories_dict', 'negative_review_tags',
'user_avg_stars', 'votes', 'business_avg_stars'])
As we hypothesized, the model tells us that the negative words contain important information in predicting star ratings! After building the tag dictionary and applying the same text transformations to the test dataset, we evaluate our model.
In [24]:
test_set['categories_dict'] = test_set.apply(lambda row: tags_to_dict(row['categories']))
test_set['categories_dict'].head(5)
test_set['negative_review_tags'] = gl.text_analytics.count_words(test_set['text'])
test_set['negative_review_tags'] = test_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)
model.evaluate(test_set)
Out[24]:
Using the tools in GraphLab Create, we can construct rich features for very powerful regression and classification models, all with very few lines of code.