Introduction to Feature Engineering

Note: This notebook uses GraphLab Create 1.0.

Feature engineering is one of the most important factors in a successful machine learning project. With GraphLab Create's SFrame, we have at our disposal the tools to make this a painless and fun ride!

Be sure to check out the Introduction to Regression Analysis notebook which provides an overview of how to use GraphLab Create to train regression and classification models, make predictions, and evaluate performance.

Overview

In this notebook, we will go over:

Creating regression models with categorical features
Encoding features using python dictionaries & lists
Converting raw text data into useful features

A preview of some fun facts revealed by our regression model:*

Restaurants in Wittman, Arizona are not great but Good Year, Arizona has great food!
Yelp reviews with more cool votes from other users are more likely to be positive reviews.
Yelp reviews with more funny votes from other users are more likely to be negative.
Yelp reviews with more useful votes from other users are more likely to be negative.

*Of course, these relationships are correlations, not causations.

Let us start by importing GraphLab Create!



In [1]:

    
import graphlab as gl
gl.canvas.set_target('ipynb')

Dataset

Like the Introduction to Regression Analysis notebook, we use data from the Yelp Dataset Challenge for this tutorial. The task is to predict the 'star rating' of a restaurant for a given user.

The data consists of three tables containing info about 11,537 businesses, 8,282 checkin sets, 43,873 users, and 229,907 reviews. Details about each of the columns in the dataset are available on the Yelp website. Please see the Introduction to Regression Analysis notebook for more details about the data preparation phase.



In [2]:

    
business = gl.SFrame('https://static.turi.com/datasets/regression/business.csv')
user = gl.SFrame('https://static.turi.com/datasets/regression/user.csv')
review = gl.SFrame('https://static.turi.com/datasets/regression/review.csv')









    



[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-39421 - Server binary: /Users/zach/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1439496626.log
[INFO] GraphLab Server Version: 1.5.2






    




PROGRESS: Downloading https://static.turi.com/datasets/regression/business.csv to /var/tmp/graphlab-zach/39421/000000.csv






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/business.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.08399 secs.






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/business.csv






    




PROGRESS: Parsing completed. Parsed 11537 lines in 0.083076 secs.






    




PROGRESS: Downloading https://static.turi.com/datasets/regression/user.csv to /var/tmp/graphlab-zach/39421/000001.csv






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/user.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.087221 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,list,str,str,float,float,str,int,int,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------





    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/user.csv






    




PROGRESS: Parsing completed. Parsed 43873 lines in 0.106489 secs.






    




PROGRESS: Downloading https://static.turi.com/datasets/regression/review.csv to /var/tmp/graphlab-zach/39421/000002.csv






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/review.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.955508 secs.






    



Inferred types from first line of file as 
column_type_hints=[float,str,int,str,str,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------





    




PROGRESS: Read 61212 lines. Lines per second: 39736.9






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/review.csv






    




PROGRESS: Parsing completed. Parsed 229907 lines in 3.93873 secs.






    



Inferred types from first line of file as 
column_type_hints=[str,str,str,int,str,str,str,dict,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [3]:

    
review_business_table = review.join(business, how='inner', on='business_id')
review_business_table = review_business_table.rename({'stars.1': 'business_avg_stars', 
                              'type.1': 'business_type',
                              'review_count': 'business_review_count'})

user_business_review_table = review_business_table.join(user, how='inner', on='user_id')
user_business_review_table = user_business_review_table.rename({'name.1': 'user_name', 
                                   'type.1': 'user_type', 
                                   'average_stars': 'user_avg_stars',
                                   'review_count': 'user_review_count'})

Let us split the prepared data into training and testing sets.



In [4]:

    
train_set, test_set = user_business_review_table.random_split(0.8, seed=1)

Feature Engineering

GraphLab Create can handle features of the following types:

Categorical variables: strings such as "Male" or "Female"
List features: lists of numeric values such as [1.0, 2.0]
Key-Value pair features: dictionaries of the form {"key1": value1, "key2": value2}

GraphLab Create's SFrame data structure is not currently optimized for holding more than a few hundred columns, but the list and dictionary types allow observations to have a much larger number of features in a single SFrame column. In this demo we will see examples of each of these complex feature types.

Adding Categorical Features

Categorical features usually require special attention in regression models. Often, this includes a messy pre-processing step to make the categorical features interpretable in the context of the model. With GraphLab Create, we can add categorical features without any special pre-processing. Throw in your features as strings and we will do all the munging for you!

The variable city is a categorical variable in our dataset. We can use the SArray's handy sketch_summary() function to get a quick glance at some useful statistics.



In [5]:

    
train_set['city'].show()

We see there are about 60 unique strings for the city (the number of unique values is approximate in the GraphLab Sketch). The regression module in GraphLab Create uses simple encoding while training models using string features. Simple encoding compares each category to an arbitrary reference category (we choose the first category the data as the reference). This article provides more details about simple encoding for regression models.

In our Introduction to Regression notebook, we used the following numerical features:

Average rating of a given business
Average rating made by a user
Number of reviews made by a user
Number of reviews that concern a business

Now we can easiliy add the city feature to the model.



In [6]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count', 
                                                'city'])









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 164133






    




PROGRESS: Number of features          : 5






    




PROGRESS: Number of unpacked features : 5






    




PROGRESS: Number of coefficients    : 65






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 2        | 1.164611     | 3.971831           | 3.511797             | 0.970378      | 0.975554        |






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Notice that the number of coefficients and the number of features aren't the same. We add dummy coefficients to encode each category. The number of these dummy coefficients is equal to the total number of categories minus 1 (for the reference category). In this example, there are 60 unique cities, so 59 dummy coefficients.

Let us see how well the model performed:



In [7]:

    
model.evaluate(test_set)









    Out[7]:





{'max_error': 4.017977977680305, 'rmse': 0.9710138573568259}

On average, our predicted rating was about 1 star away from the true rating but there were some corner cases which were off by almost 4 stars. Let's inspect the model to get more insight:



In [8]:

    
model.summary()









    



Class                         : LinearRegression

Schema
------
Number of coefficients        : 65
Number of examples            : 164133
Number of feature columns     : 5
Number of unpacked features   : 5

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 1
Solver status                 : SUCCESS: Optimal solution found.
Training time (sec)           : 1.2281

Settings
--------
Residual sum of squares       : 154553.1426
Training RMSE                 : 0.9704

Highest Positive Coefficients
-----------------------------
user_avg_stars                : 0.8117
business_avg_stars            : 0.7809
city[Sun City Anthem]         : 0.4101
city[North Pinal]             : 0.3613
city[Tonopah]                 : 0.3592

Lowest Negative Coefficients
----------------------------
(intercept)                   : -2.2367
city[Charleston]              : -0.3228
city[Wittmann]                : -0.1469
city[Good Year]               : -0.1394
city[Ahwatukee]               : -0.1334

The coefficients for the categorical variables indicate the strength of the association between a category and the rating. In this example, we learn that restauraunts in Wittman, AZ and Florence, SC are likely to have worse ratings in comparison with Good Year, AZ and Grand Junction, CO, holding the other features constant.

Wikipedia claims Grand Junction, CO was number six in Outdoor Life's 2012 list of the 35 Best Hunting and Fishing Towns in the US. I am definitely planning my next vacation to Grand Junction.

Categorical Features at Scale

GraphLab Create is built to scale. It can handle categorical variables with millions of categories, which we illustrate now by adding the user-id and business-id variables to our model.

Notice that we omit the city as a feature because the business ID uniquely identifies the city. Removing this redundant information not only improves our ability to interpret the model, it makes the model mathematically tractable.



In [9]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','business_avg_stars'],
                                    max_iterations=10)









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 163817






    




PROGRESS: Number of features          : 4






    




PROGRESS: Number of unpacked features : 4






    




PROGRESS: Number of coefficients    : 49594






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000001  | 0.468029     | 3.283380           | 3.534957             | 0.997569      | 1.112275        |






    




PROGRESS: | 2         | 9        | 5.000000  | 0.737533     | 3.980484           | 5.157684             | 0.854929      | 1.202282        |






    




PROGRESS: | 3         | 10       | 5.000000  | 0.858377     | 9.070351           | 6.289928             | 1.048146      | 1.327623        |






    




PROGRESS: | 4         | 12       | 1.000000  | 1.043201     | 3.845854           | 5.195805             | 0.832746      | 1.202499        |






    




PROGRESS: | 5         | 13       | 1.000000  | 1.167821     | 3.853705           | 5.458947             | 0.832023      | 1.197634        |






    




PROGRESS: | 6         | 14       | 1.000000  | 1.280964     | 3.811118           | 5.387041             | 0.829696      | 1.199560        |






    




PROGRESS: | 10        | 18       | 1.000000  | 1.766194     | 3.841739           | 5.964289             | 0.826468      | 1.219763        |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

We didn't have any trouble training the model with over 50K terms in the model. In fact, GraphLab Create can work with millions of features and billions of examples.

Note, however, that the Solver status is now TERMINATED: Iteration limit reached. In the previous model, the solver reached the best possible solution in only one pass over the data, but in this case it reached the default stopping point of 10 iterations without finding the optimal solution. We can increase the iteration limit with the max_iterations option. With 100 iterations, the results do look slightly different.



In [10]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','business_avg_stars'],
                                    max_iterations=100)









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 163911






    




PROGRESS: Number of features          : 4






    




PROGRESS: Number of unpacked features : 4






    




PROGRESS: Number of coefficients    : 49555






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000001  | 0.460122     | 3.369603           | 3.407089             | 0.998567      | 1.094553        |






    




PROGRESS: | 2         | 9        | 5.000000  | 0.722242     | 3.899629           | 5.434818             | 0.856228      | 1.185424        |






    




PROGRESS: | 3         | 10       | 5.000000  | 0.838821     | 9.081270           | 7.660760             | 1.048927      | 1.307543        |






    




PROGRESS: | 4         | 12       | 1.000000  | 1.025269     | 3.732869           | 5.735531             | 0.834216      | 1.185304        |






    




PROGRESS: | 5         | 13       | 1.000000  | 1.142810     | 3.866888           | 5.961030             | 0.833505      | 1.179896        |






    




PROGRESS: | 6         | 14       | 1.000000  | 1.258525     | 3.804760           | 5.899488             | 0.831191      | 1.182059        |






    




PROGRESS: | 10        | 18       | 1.000000  | 1.707009     | 3.809888           | 6.321420             | 0.828043      | 1.202572        |






    




PROGRESS: | 11        | 19       | 1.000000  | 1.826610     | 3.807782           | 6.291866             | 0.827886      | 1.203822        |






    




PROGRESS: | 20        | 28       | 1.000000  | 2.814037     | 3.810041           | 6.243981             | 0.827690      | 1.207188        |






    




PROGRESS: | 30        | 38       | 1.000000  | 3.965280     | 3.805886           | 6.290817             | 0.827427      | 1.179154        |






    




PROGRESS: | 40        | 48       | 1.000000  | 5.086453     | 3.812939           | 6.238528             | 0.827203      | 1.177173        |






    




PROGRESS: | 50        | 58       | 1.000000  | 6.187672     | 3.813595           | 6.226176             | 0.827194      | 1.177004        |






    




PROGRESS: | 51        | 59       | 1.000000  | 6.309958     | 3.813806           | 6.227460             | 0.827193      | 1.176970        |






    




PROGRESS: | 60        | 68       | 1.000000  | 7.292853     | 3.815146           | 6.307371             | 0.827138      | 1.175079        |






    




PROGRESS: | 70        | 78       | 1.000000  | 8.491056     | 3.813243           | 6.234913             | 0.827068      | 1.170991        |






    




PROGRESS: | 80        | 88       | 1.000000  | 9.672840     | 3.812541           | 6.228804             | 0.827066      | 1.170969        |






    




PROGRESS: | 90        | 98       | 1.000000  | 10.935462    | 3.812536           | 6.228453             | 0.827066      | 1.170973        |






    




PROGRESS: | 100       | 109      | 1.000000  | 12.224540    | 3.812523           | 6.228486             | 0.827066      | 1.170980        |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Dictionary and List Features

GraphLab Create can also handle dictionary features without any manual pre-processing. The Yelp reviews have information on the number of funny, useful, or cool votes received by each review. These tallies are stored in the dataset in the form of dictionaries, one per review.



In [11]:

    
train_set['votes'].head(3)









    Out[11]:





dtype: dict
Rows: 3
[{'funny': 0, 'useful': 5, 'cool': 2}, {'funny': 0, 'useful': 0, 'cool': 0}, {'funny': 0, 'useful': 1, 'cool': 0}]

We can use these dictionaries to create a linear regression model without manual data munging. Each key in a dictionary feature is treated as a separate term in the model.



In [12]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','votes', 'business_avg_stars'])









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 164059






    




PROGRESS: Number of features          : 5






    




PROGRESS: Number of unpacked features : 7






    




PROGRESS: Number of coefficients    : 49619






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000001  | 0.767964     | 29.188760          | 12.992576            | 1.251132      | 1.359791        |






    




PROGRESS: | 2         | 8        | 1.000000  | 1.090023     | 13.720966          | 7.013139             | 1.044006      | 1.200531        |






    




PROGRESS: | 3         | 9        | 1.000000  | 1.284569     | 4.670274           | 4.563171             | 0.868288      | 1.146599        |






    




PROGRESS: | 4         | 10       | 1.000000  | 1.478503     | 4.260052           | 5.136405             | 0.842987      | 1.177090        |






    




PROGRESS: | 5         | 11       | 1.000000  | 1.679415     | 4.661242           | 6.063200             | 0.827413      | 1.208761        |






    




PROGRESS: | 6         | 12       | 1.000000  | 1.875601     | 3.932392           | 6.590969             | 0.821582      | 1.220298        |






    




PROGRESS: | 10        | 16       | 1.000000  | 2.657234     | 4.753965           | 6.827669             | 0.811574      | 1.202802        |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Our model tells us that, all else equal:

Reviews with more cool votes are more likely to be positive.
Reviews with more funny votes are more likely to be negative.
Reviews with more useful votes are more likely to be negative.

GraphLab Create can also handle list features without preprocessing. As an illustration, suppose the values of the votes dictionary variable had instead been stored in a list. We can use the SArray apply function to simulate this situation.



In [13]:

    
train_set['votes_list'] = train_set['votes'].apply(lambda x: x.values())
train_set['votes_list'].head(3)









    Out[13]:





dtype: array
Rows: 3
[array('d', [0.0, 5.0, 2.0]), array('d', [0.0, 0.0, 0.0]), array('d', [0.0, 1.0, 0.0])]

As with the dictionary feature type, each entry of the list is treated as a separate term in the model.



In [14]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id',
                                                'user_avg_stars','votes_list', 'business_avg_stars'])









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 163835






    




PROGRESS: Number of features          : 5






    




PROGRESS: Number of unpacked features : 7






    




PROGRESS: Number of coefficients    : 49601






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000001  | 0.684234     | 29.125030          | 19.222386            | 1.249962      | 1.344307        |






    




PROGRESS: | 2         | 8        | 1.000000  | 0.984910     | 13.670756          | 10.705874            | 1.042516      | 1.207940        |






    




PROGRESS: | 3         | 9        | 1.000000  | 1.161310     | 4.681074           | 4.237997             | 0.867437      | 1.170379        |






    




PROGRESS: | 4         | 10       | 1.000000  | 1.329161     | 4.262230           | 4.633246             | 0.842141      | 1.201453        |






    




PROGRESS: | 5         | 11       | 1.000000  | 1.499019     | 4.668176           | 5.885210             | 0.826597      | 1.230728        |






    




PROGRESS: | 6         | 12       | 1.000000  | 1.676652     | 3.888852           | 6.613623             | 0.820716      | 1.240989        |






    




PROGRESS: | 10        | 16       | 1.000000  | 2.365044     | 4.670782           | 6.092130             | 0.810829      | 1.222907        |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

The only difference between the model trained using votes and votes_list is in the annotation of the returned coefficients. For the dictionary feature, we returned votes_list[cool] but now we return votes_list[0] (for example).

Using Review Category Tags

Yelp reviews contain useful tags that describe each business. Unlike categorical variables, a business may have several tags. Confusingly, these tags are stored as lists of strings in the categories column in our datatset:



In [15]:

    
train_set['categories'].head(5)









    Out[15]:





dtype: list
Rows: 5
[['Breakfast & Brunch', 'Restaurants'], ['Italian', 'Pizza', 'Restaurants'], ['Middle Eastern', 'Restaurants'], ['Active Life', 'Dog Parks', 'Parks'], ['Tires', 'Automotive']]

Tag data takes a bit of pre-processing. Let us define a function that converts each list of the form [tag_1, tag_2, ..., tag_n] to a dictionary of the form {tag_1: 1, tag_2: 1, ..., tag_n: 1} where the keys are the tags and all values are 1 (indicating that the tag was present). We then apply this to each of the tag lists in our data.



In [16]:

    
tags_to_dict = lambda tags: dict(zip(tags, [1 for tag in tags]))



In [17]:

    
train_set['categories_dict'] = train_set.apply(lambda row: tags_to_dict(row['categories']))
train_set['categories_dict'].head(5)









    Out[17]:





dtype: dict
Rows: 5
[{'Breakfast & Brunch': 1, 'Restaurants': 1}, {'Restaurants': 1, 'Pizza': 1, 'Italian': 1}, {'Middle Eastern': 1, 'Restaurants': 1}, {'Dog Parks': 1, 'Parks': 1, 'Active Life': 1}, {'Tires': 1, 'Automotive': 1}]

Now we create a linear regression model with the categories_dict included in the feature list:



In [18]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id','business_id', 'categories_dict',
                                                'user_avg_stars','votes', 'business_avg_stars'])









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 164094






    




PROGRESS: Number of features          : 6






    




PROGRESS: Number of unpacked features : 515






    




PROGRESS: Number of coefficients    : 50096






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000000  | 0.959544     | 19.298226          | 6.899282             | 1.289207      | 1.339919        |






    




PROGRESS: | 2         | 9        | 5.000000  | 1.651001     | 15.467935          | 5.578126             | 1.031984      | 1.233318        |






    




PROGRESS: | 3         | 10       | 5.000000  | 1.874413     | 34.123218          | 10.790138            | 2.000561      | 2.094358        |






    




PROGRESS: | 4         | 12       | 1.000000  | 2.256501     | 3.874405           | 4.821244             | 0.858296      | 1.152622        |






    




PROGRESS: | 5         | 13       | 1.000000  | 2.485737     | 3.950751           | 4.913374             | 0.849481      | 1.154336        |






    




PROGRESS: | 6         | 14       | 1.000000  | 2.771518     | 3.859116           | 5.175627             | 0.838634      | 1.166970        |






    




PROGRESS: | 10        | 18       | 1.000000  | 3.804033     | 3.838935           | 6.071416             | 0.818022      | 1.197770        |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

The model shows the tag "Food" has a slightly positive association with a business's rating, whereas "Restaurants" is slightly negatively associated. The influences are really small, so we will ingore them.

Text Data: Using Raw Review Data

GraphLab Create's SArray has several very useful text processing capabilities. In this section, we apply these to the raw text of Yelp business reviews to improve our linear model's predictions.

For this, we use the count_words SArray function converts a raw text string into a dictionary where the keys are the words and the values are the word counts. For example, the first review is:



In [19]:

    
train_set['text'].head(1)









    Out[19]:





dtype: str
Rows: 1
['My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!']

And the resulting word count dictionary:



In [20]:

    
train_set['negative_review_tags'] = gl.text_analytics.count_words(train_set['text'])
train_set['negative_review_tags'].head(1)









    Out[20]:





dtype: dict
Rows: 1
[{'better.': 1, 'looks': 1, 'go': 1, 'perfect': 1, 'everything': 1, 'menu': 1, 'had': 1, 'to': 1, 'only': 1, 'pleasure.': 1, 'pretty': 2, 'it.': 1, 'do': 1, 'them': 1, 'garden': 1, 'sitting': 1, 'food': 1, 'they': 1, 'yourself': 1, '"toast"': 1, 'bread': 1, 'like': 1, 'had.': 2, 'weather': 1, 'amazing.': 1, 'meal': 1, 'absolutely': 1, 'our': 2, 'saturday': 1, 'best': 2, 'for': 1, 'phenomenal': 1, 'favor': 1, 'outside': 1, 'truffle': 1, 'ever': 2, 'anyway,': 1, 'here': 2, 'wait': 1, 'on': 3, 'semi-busy': 1, 'of': 1, 'place': 1, "i'm": 1, 'waitress': 1, 'grounds': 1, 'complete.': 1, 'bloody': 1, 'griddled': 1, 'simply': 1, 'skillet': 1, 'morning.': 1, 'use': 1, 'from': 1, 'quickly': 2, 'their': 4, '2': 1, 'delicious.': 1, 'white': 1, 'was': 8, "i've": 2, 'took': 1, 'excellent': 1, 'an': 1, 'with': 2, 'me': 1, 'made': 2, 'wife': 1, 'up': 1, 'while': 1, 'my': 2, 'and': 8, 'it': 8, 'pieces': 1, 'tasty': 1, 'breakfast': 1, 'absolute': 1, 'ingredients': 1, 'get': 2, 'when': 1, 'amazing': 1, 'which': 1, 'vegetable': 1, 'you': 2, 'excellent,': 1, 'excellent.': 1, 'mary.': 1, 'sure': 1, 'eggs': 1, 'earlier': 1, 'birthday': 1, 'arrived': 1, 'a': 1, 'overlooking': 1, 'fills': 1, 'i': 2, 'looked': 1, 'scrambled': 1, 'so': 1, "can't": 1, 'fresh': 1, 'the': 10, 'blend': 1, 'order': 1, 'came': 1, 'back!': 1}]

We now use the dict_trim_by_keys SArray function to trim the dictionary down to a set of words we are intereseted in.

Our belief is that negative words suchs as filthy and disgusting are useful in predicting when a rating will be bad, so we construct a feature that captures these words. This belief is not justified by statistical or machine learning considerations, but this type of feature engineering often makes the difference between a good model and a great model.



In [21]:

    
bad_review_words = ['hate','terrible', 'awful', 'spit', 'disgusting', 'filthy', 'tasteless', 'rude', 
                    'dirty', 'slow', 'poor', 'late', 'angry', 'flies', 'disappointed', 'disappointing', 'wait', 
                    'waiting', 'dreadful', 'appalling', 'horrific', 'horrifying', 'horrible', 'horrendous', 'atrocious', 
                    'abominable', 'deplorable', 'abhorrent', 'frightful', 'shocking', 'hideous', 'ghastly', 'grim', 
                    'dire', 'unspeakable', 'gruesome']
train_set['negative_review_tags'] = train_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)



In [22]:

    
train_set['negative_review_tags'].head(5)









    Out[22]:





dtype: dict
Rows: 5
[{'wait': 1}, {'wait': 1}, {}, {}, {}]



In [23]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_id', 'business_id', 'categories_dict', 'negative_review_tags', 
                                                'user_avg_stars', 'votes', 'business_avg_stars'])









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 163862






    




PROGRESS: Number of features          : 7






    




PROGRESS: Number of unpacked features : 551






    




PROGRESS: Number of coefficients    : 50096






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000000  | 0.951407     | 18.601055          | 6.417567             | 1.326444      | 1.400186        |






    




PROGRESS: | 2         | 9        | 5.000000  | 1.629481     | 12.550799          | 5.339610             | 0.999836      | 1.213895        |






    




PROGRESS: | 3         | 10       | 5.000000  | 1.908252     | 27.067990          | 11.187614            | 1.953237      | 2.073300        |






    




PROGRESS: | 4         | 12       | 1.000000  | 2.320929     | 4.832117           | 5.586887             | 0.834437      | 1.150442        |






    




PROGRESS: | 5         | 13       | 1.000000  | 2.594705     | 4.666927           | 5.615659             | 0.825915      | 1.151855        |






    




PROGRESS: | 6         | 14       | 1.000000  | 2.916074     | 4.656628           | 5.691390             | 0.817557      | 1.157593        |






    




PROGRESS: | 10        | 18       | 1.000000  | 4.056536     | 5.133884           | 6.699757             | 0.794140      | 1.183943        |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

As we hypothesized, the model tells us that the negative words contain important information in predicting star ratings! After building the tag dictionary and applying the same text transformations to the test dataset, we evaluate our model.



In [24]:

    
test_set['categories_dict'] = test_set.apply(lambda row: tags_to_dict(row['categories']))
test_set['categories_dict'].head(5)

test_set['negative_review_tags'] = gl.text_analytics.count_words(test_set['text'])
test_set['negative_review_tags'] = test_set['negative_review_tags'].dict_trim_by_keys(bad_review_words, exclude=False)
model.evaluate(test_set)









    Out[24]:





{'max_error': 6.219940558611534, 'rmse': 1.1607328324469728}

Using the tools in GraphLab Create, we can construct rich features for very powerful regression and classification models, all with very few lines of code.