Note: This notebook requires GraphLab Create 1.2 or higher.
Creating regression models is easy with GraphLab Create! The regression/classification toolkit contains several models including (but not restricted to) linear regression, logistic regression, and gradient boosted trees. All models are built to work with millions of features and billions of examples. The models differ in how they make predictions, but conform to the same API. Like all GraphLab Create toolkits, you can call create() to create a model, predict() to make predictions on the returned model, and evaluate() to measure performance of the predictions.
Be sure to check out our notebook on feature engineering which discusses advanced features including the use of categorical variables, dictionary features, and text data. All of these feature types are easy and intuitive to use with GraphLab Create.
In this notebook, we will go over how GraphLab Create can be used for basic tasks in regression analysis. Specifically, we will go over:
We will start by importing GraphLab Create!
In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')
In this notebook, we will use a subset of the data from the Yelp Dataset Challenge for this tutorial. The task is to predict the 'star rating' for a restaurant for a given user. The dataset comprises three tables that cover 11,537 businesses, 8,282 check-ins, 43,873 users, and 229,907 reviews. The entire dataset as well as details about the dataset are available on the Yelp website.
The review table includes information about each review. Specifically, it contains:
The user table consists of details about each user:
The business table contains details about each business:
Let us take a closer look at the data.
In [2]:
business = gl.SFrame('https://static.turi.com/datasets/regression/business.csv')
user = gl.SFrame('https://static.turi.com/datasets/regression/user.csv')
review = gl.SFrame('https://static.turi.com/datasets/regression/review.csv')
The schema and the first few entries of the review are shown below. For the sake of brevity, we will skip the business and user tables.
In [3]:
review.show()
In this section, we will go through some basic steps to prepare the dataset for regression models.
First, we use an SFrame join operation to merge the business and review tables, using the business_id column to "match" the rows of the two tables. The output of the join is a single table with both business and review information. For clarity we rename some of the business columns to have more meaningful descriptions.
In [4]:
review_business_table = review.join(business, how='inner', on='business_id')
review_business_table = review_business_table.rename({'stars.1': 'business_avg_stars',
'type.1': 'business_type',
'review_count': 'business_review_count'})
Now, join user table to the result, using the user_id column to match rows. Now we have review, business, and user information in a single table.
In [5]:
user_business_review_table = review_business_table.join(user, how='inner', on="user_id")
user_business_review_table = user_business_review_table.rename({'name.1': 'user_name',
'type.1': 'user_type',
'average_stars': 'user_avg_stars',
'review_count': 'user_review_count'})
Now we're good to go! Let's take a look at what the final dataset looks like:
In [6]:
user_business_review_table.head(5)
Out[6]:
It's now time to do some data science! First, let us split our data into training and testing sets, using SFrame's random_split function.
In [7]:
train_set, test_set = user_business_review_table.random_split(0.8, seed=1)
Let's start out with a simple model. The target is the star rating for each review and the features are:
In [8]:
model = gl.linear_regression.create(train_set, target='stars',
features = ['user_avg_stars','business_avg_stars',
'user_review_count', 'business_review_count'])
Much of the summary output is self-explanatory. We will explain below what the terms 'coefficients' and 'errors' mean.
GraphLab Create easily allows you to make predictions using the created model with the predict function. The predict function returns an SArray with a prediction for each example in the test dataset.
In [9]:
predictions = model.predict(test_set)
predictions.head(5)
Out[9]:
We can also evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics: root-mean-square error (RMSE) is a global summary of the differences between predicted values and the values actually observed, while max-error measures the worst case performance of the model on a single observation. In this example, our model made predictions which were about 1 star away from the true rating (on average) but there were a few cases where we were off by almost 4 stars.
In [10]:
model.evaluate(test_set)
Out[10]:
Let's go further in analyzing how well our model performed at predicting ratings. We perform a groupby-aggregate to calculate the average predicted rating (on the test set) for each value of the actual rating (1-5). This will help us understand when the model performs well and when it does not.
In [11]:
sf = gl.SFrame()
sf['Predicted-Rating'] = predictions
sf['Actual-Rating'] = test_set['stars']
predict_count = sf.groupby('Actual-Rating', [gl.aggregate.COUNT('Actual-Rating'), gl.aggregate.AVG('Predicted-Rating')])
predict_count.topk('Actual-Rating', k=5, reverse=True)
Out[11]:
It looks like our model does well on ratings that were between 3 and 5 but not too well on ratings 1 and 2. One reason why this could happen is that we have a lot more reviews with 4 and 5 star ratings. In fact, the number of 4 and 5 star reviews is more than twice the number of reviews with 1-3 stars.
In addition to making predictions about new data, GraphLab's regression toolkit can provide valuable insight about the relationships between the target and feature columns in your data, revealing why your model returns the predictions that it does. Let's briefly venture into some mathematical details to explain. Linear regression models the target $Y$ as a linear combination of the feature variables $X_j$, random noise $\epsilon$, and a bias term ($\alpha_0$) (also known as the intercept or global offset):
$$Y = \alpha_0 + \sum_{j} \alpha_j X_j + \epsilon$$The coefficients ($\alpha_j$) are what the training procedure learns. Each model coefficient describes the expected change in the target variable associated with a unit change in the feature. The bias term indicates the "inherent" or "average" target value if all feature values were set to zero.
The coefficients often tell an interesting story of how much each feature matters in predicting target values. The magnitude (absolute value) of the coefficient for each feature indicates the strength of the feature's association to the target variable, holding all other features constant. The sign on the coefficient (positive or negative) gives the direction of the association.
For a trained model, we can access the coefficients as follows. The name is the name of the feature, the index refers to a category for categorical variables, and the value is the value of the coefficient.
In [12]:
coefs = model['coefficients']
coefs
Out[12]:
Not surpisingly, high ratings are associated with (i) users who give a lot of high ratings on average, and (ii) businesses that receive high ratings on average. More interestingly, the number of reviews submitted by a user or recieved by a business appears to have a very weak association with ratings.
Logistic regression is a model that is popularly used for classification tasks. In logistic regression, the probability that a binary target is True is modeled as a logistic function of the features.
First, let's construct a binary target variable. In this example, we will predict if a restaurant is good or bad, with 1 and 2 star ratings indicating a bad business and 3-5 star ratings indicating a good one.
In [13]:
user_business_review_table['is_good'] = user_business_review_table['stars'] >= 3
First, let's create a train-test split:
In [14]:
train_set, test_set = user_business_review_table.random_split(0.8, seed=1)
We will use the same set of features that we used for the linear regression model. Note that the API is very similar to the linear regression API.
In [15]:
model = gl.logistic_classifier.create(train_set, target="is_good",
features = ['user_avg_stars','business_avg_stars',
'user_review_count', 'business_review_count'])
Logistic regression predictions can take one of three forms:
GraphLab's logistic regression model can return predictions for any of these types:
In [16]:
# Probability
predictions = model.predict(test_set)
predictions.head(5)
Out[16]:
In [17]:
predictions = model.predict(test_set, output_type = "margin")
predictions.head(5)
Out[17]:
In [18]:
predictions = model.predict(test_set, output_type = "probability")
predictions.head(5)
Out[18]:
We can evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics:
In [19]:
result = model.evaluate(test_set)
print "Accuracy : %s" % result['accuracy']
print "Confusion Matrix : \n%s" % result['confusion_matrix']
GraphLab Create's evaluation toolkit contains more detail on evaluation metrics for both regression and classification. You are now good to go with regression! Be sure to check out our notebook on feature engineering to learn new tricks that can help you make better classifiers and predictors!
Logistic Regression can also be used for multiclass classficiation. Multiclass classification allows each observation to be assigned to one of many categories (for example: ratings may be 1, 2, 3, 4, or 5). In this example, we will predict the rating of the restaurant.
In [20]:
model = gl.logistic_classifier.create(train_set, target="stars",
features = ['user_avg_stars','business_avg_stars',
'user_review_count', 'business_review_count'])
Statistics about the training data including the number of classes, the set of classes registered in the dataset, as well as the number of examples in each class are stored in the model.
In [21]:
print "This model has %s classes" % model['num_classes']
print "The set of classes in the training set are %s" % model['classes']
While training models for multiclass classification, the top-k predictions can be of the following type.
In the following example, we calculate the top-2 probabilities, margins, and ranks of predictions.
In [22]:
predictions = model.predict_topk(test_set, output_type = 'probability', k = 2)
predictions.head(5)
Out[22]:
In [23]:
predictions = model.predict_topk(test_set, output_type = 'margin', k = 2)
predictions.head(5)
Out[23]:
In [24]:
predictions = model.predict_topk(test_set, output_type = 'rank', k = 2)
predictions.head(5)
Out[24]:
Similar to binary classifications, we can evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics:
In [25]:
result = model.evaluate(test_set)
print "Confusion Matrix : \n%s" % result['confusion_matrix']
Many difficult real-world problems have imbalanced data, where at least one class is under-represented. GraphLab Create models can improve prediction quality for some unbalanced scenarios by assigning different costs to misclassification errors for different classes.
Let us see the distribution of examples for each class in the dataset.
In [26]:
review['stars'].astype(str).show()
In [27]:
model = gl.logistic_classifier.create(train_set, target="stars",
features = ['user_avg_stars','business_avg_stars',
'user_review_count', 'business_review_count'],
class_weights = 'auto')
In [28]:
result = model.evaluate(test_set)
print "Confusion Matrix : \n%s" % result['confusion_matrix']