Introduction to Regression & Classification

Note: This notebook requires GraphLab Create 1.2 or higher.

Creating regression models is easy with GraphLab Create! The regression/classification toolkit contains several models including (but not restricted to) linear regression, logistic regression, and gradient boosted trees. All models are built to work with millions of features and billions of examples. The models differ in how they make predictions, but conform to the same API. Like all GraphLab Create toolkits, you can call create() to create a model, predict() to make predictions on the returned model, and evaluate() to measure performance of the predictions.

Be sure to check out our notebook on feature engineering which discusses advanced features including the use of categorical variables, dictionary features, and text data. All of these feature types are easy and intuitive to use with GraphLab Create.

Overview

In this notebook, we will go over how GraphLab Create can be used for basic tasks in regression analysis. Specifically, we will go over:

Training, prediction, and evaluation of models
Interpreting the results of the model
Binary classification
Multiclass classification
Handling Imbalanced Classes

We will start by importing GraphLab Create!



In [1]:

    
import graphlab as gl
gl.canvas.set_target('ipynb')

Data Overview

In this notebook, we will use a subset of the data from the Yelp Dataset Challenge for this tutorial. The task is to predict the 'star rating' for a restaurant for a given user. The dataset comprises three tables that cover 11,537 businesses, 8,282 check-ins, 43,873 users, and 229,907 reviews. The entire dataset as well as details about the dataset are available on the Yelp website.

Review Data

The review table includes information about each review. Specifically, it contains:

business_id: An encrypted business ID for the business being reviewed.
user_id: An encrypted user ID for the user who provided the review.
stars: A star rating (on a scale of 1-5)
text: The raw review text.
date: Date, formatted like '2012-03-14'
votes: The number of 'useful', 'funny' or 'cool' votes provided by other users for this review.

User Data

The user table consists of details about each user:

user_id: The encrypted user ID (cross referenced in the Review table)
name: First name
review_count: Total number of reviews made by the user.
average_stars: Average rating (on a scale of 1-5) made by the user.
votes: For each review type i.e ('useful', 'funny', 'cool') the total number of votes for reviews made by this user.

Business Data

The business table contains details about each business:

business_id: Encrypted business ID (cross referenced in the Review table)
name: Business name.
neighborhoods: Neighborhoods served by the business.
full_address: Address (text format)
city: City where the business is located.
state: State where the business is located.
latitude: Latitude of the business.
longitude: Longitude of the business.
stars: A star rating (rounded to half-stars) for this business.
review_count: The total number of reviews about this business.
categories: Category tags for this business.
open: Is this business still open? (True/False)

Let us take a closer look at the data.



In [2]:

    
business = gl.SFrame('https://static.turi.com/datasets/regression/business.csv')
user = gl.SFrame('https://static.turi.com/datasets/regression/user.csv')
review = gl.SFrame('https://static.turi.com/datasets/regression/review.csv')









    



[INFO] 1446791576 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /Users/roman/miniconda3/envs/rc17-conda/lib/python2.7/site-packages/certifi/cacert.pem
1446791576 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-83612 - Server binary: /Users/roman/miniconda3/envs/rc17-conda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1446791576.log
[INFO] GraphLab Server Version: 1.6.908






    




PROGRESS: Downloading https://static.turi.com/datasets/regression/business.csv to /var/tmp/graphlab-roman/83612/1f7c7329-3886-46eb-9956-da43084c959b.csv






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/business.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.069744 secs.






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/business.csv






    




PROGRESS: Parsing completed. Parsed 11537 lines in 0.065551 secs.






    




PROGRESS: Downloading https://static.turi.com/datasets/regression/user.csv to /var/tmp/graphlab-roman/83612/1d0f9518-b850-481e-8a8f-cd7bf7a6b6e2.csv






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/user.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.080215 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,list,str,str,float,float,str,int,int,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------





    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/user.csv






    




PROGRESS: Parsing completed. Parsed 43873 lines in 0.088443 secs.






    




PROGRESS: Downloading https://static.turi.com/datasets/regression/review.csv to /var/tmp/graphlab-roman/83612/5d90f57f-3390-46b1-9451-58d73f9114e6.csv






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/review.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.804394 secs.






    



Inferred types from first line of file as 
column_type_hints=[float,str,int,str,str,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------





    




PROGRESS: Read 61212 lines. Lines per second: 49268.1






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/regression/review.csv






    




PROGRESS: Parsing completed. Parsed 229907 lines in 3.22541 secs.






    



Inferred types from first line of file as 
column_type_hints=[str,str,str,int,str,str,str,dict,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

The schema and the first few entries of the review are shown below. For the sake of brevity, we will skip the business and user tables.



In [3]:

    
review.show()

Preparing the data

In this section, we will go through some basic steps to prepare the dataset for regression models.

First, we use an SFrame join operation to merge the business and review tables, using the business_id column to "match" the rows of the two tables. The output of the join is a single table with both business and review information. For clarity we rename some of the business columns to have more meaningful descriptions.



In [4]:

    
review_business_table = review.join(business, how='inner', on='business_id')
review_business_table = review_business_table.rename({'stars.1': 'business_avg_stars', 
                              'type.1': 'business_type',
                              'review_count': 'business_review_count'})

Now, join user table to the result, using the user_id column to match rows. Now we have review, business, and user information in a single table.



In [5]:

    
user_business_review_table = review_business_table.join(user, how='inner', on="user_id")
user_business_review_table = user_business_review_table.rename({'name.1': 'user_name', 
                                   'type.1': 'user_type', 
                                   'average_stars': 'user_avg_stars',
                                   'review_count': 'user_review_count'})

Now we're good to go! Let's take a look at what the final dataset looks like:



In [6]:

    
user_business_review_table.head(5)









    Out[6]:





    
        business_id
        date
        review_id
        stars
        text
        type
    
    
        9yKzy9PApeiPPOUJEtnvkg
        2011-01-26
        fWKvX83p0-ka4JS3dc6E5A
        5
        My wife took me here on
my birthday for break ...
        review
    
    
        ZRJwVLyzEJq1VAihDhYiow
        2011-07-27
        IjZ33sJrzXqU-0X6U8NwyA
        5
        I have no idea why some
people give bad reviews ...
        review
    
    
        6oRAC4uyJCsJl1X0WZpVSA
        2012-06-14
        IESLBzqUCLdSzSqm0eCSxQ
        4
        love the gyro plate. Rice
is so good and I also ...
        review
    
    
        _1QQZuf4zZOyFCvXc0o6Vg
        2010-05-27
        G-WvGaISbqqaMHlNnByodA
        5
        Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
        review
    
    
        6ozycU1RpktNG2-1BroVtw
        2012-01-05
        1uJFq2r5QfJG_6ExMRCaGw
        5
        General Manager Scott
Petello is a good egg!!! ...
        review
    


    
        user_id
        votes
        year
        month
        day
        categories
        city
    
    
        rLtl8ZkDX5vH5nAx9C3q5Q
        {'funny': 0, 'useful': 5,
'cool': 2} ...
        2011
        1
        26
        [Breakfast & Brunch,
Restaurants] ...
        Phoenix
    
    
        0a2KyEL0d3Yb1V6aivbIuQ
        {'funny': 0, 'useful': 0,
'cool': 0} ...
        2011
        7
        27
        [Italian, Pizza,
Restaurants] ...
        Phoenix
    
    
        0hT2KtfLiobPvh6cDC8JQg
        {'funny': 0, 'useful': 1,
'cool': 0} ...
        2012
        6
        14
        [Middle Eastern,
Restaurants] ...
        Tempe
    
    
        uZetl9T0NcROGOyFfughhg
        {'funny': 0, 'useful': 2,
'cool': 1} ...
        2010
        5
        27
        [Active Life, Dog Parks,
Parks] ...
        Scottsdale
    
    
        vYmM4KTsC8ZfQBg-j5MWkw
        {'funny': 0, 'useful': 0,
'cool': 0} ...
        2012
        1
        5
        [Tires, Automotive]
        Mesa
    


    
        full_address
        latitude
        longitude
        name
        open
        business_review_count
        business_avg_stars
    
    
        6106 S 32nd St\nPhoenix,
AZ 85042 ...
        33.3908
        -112.013
        Morning Glory Cafe
        1
        116
        4.0
    
    
        4848 E Chandler
Blvd\nPhoenix, AZ 85044 ...
        33.3056
        -111.979
        Spinato's Pizzeria
        1
        102
        4.0
    
    
        1513 E  Apache
Blvd\nTempe, AZ 85281 ...
        33.4143
        -111.913
        Haji-Baba
        1
        265
        4.5
    
    
        5401 N Hayden
Rd\nScottsdale, AZ 85250 ...
        33.5229
        -111.908
        Chaparral Dog Park
        1
        88
        4.5
    
    
        1357 S Power Road\nMesa,
AZ 85206 ...
        33.391
        -111.684
        Discount Tire
        1
        5
        4.5
    


    
        state
        business_type
        user_avg_stars
        user_name
        user_review_count
        user_type
        votes_funny
        votes_cool
        votes_useful
    
    
        AZ
        business
        3.72
        Jason
        376
        user
        331
        322
        1034
    
    
        AZ
        business
        5.0
        Paul
        2
        user
        2
        0
        0
    
    
        AZ
        business
        4.33
        Nicole
        3
        user
        0
        0
        3
    
    
        AZ
        business
        4.29
        lindsey
        31
        user
        18
        36
        75
    
    
        AZ
        business
        3.25
        Roger
        28
        user
        3
        8
        32
    

[5 rows x 29 columns]

Training, Predicting, and Evaluating Models

It's now time to do some data science! First, let us split our data into training and testing sets, using SFrame's random_split function.



In [7]:

    
train_set, test_set = user_business_review_table.random_split(0.8, seed=1)

Let's start out with a simple model. The target is the star rating for each review and the features are:

Average rating of a given business
Average rating made by a user
Number of reviews made by a user
Number of reviews that concern a business



In [8]:

    
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'])









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 163976






    




PROGRESS: Number of features          : 4






    




PROGRESS: Number of unpacked features : 4






    




PROGRESS: Number of coefficients    : 5






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 2        | 1.061012     | 3.974238           | 3.731815             | 0.971716      | 0.953531        |






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: SUCCESS: Optimal solution found.






    




PROGRESS:






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Much of the summary output is self-explanatory. We will explain below what the terms 'coefficients' and 'errors' mean.

Making Predictions

GraphLab Create easily allows you to make predictions using the created model with the predict function. The predict function returns an SArray with a prediction for each example in the test dataset.



In [9]:

    
predictions = model.predict(test_set)
predictions.head(5)









    Out[9]:





dtype: float
Rows: 5
[3.008125182034785, 4.6755074163872745, 4.607010044148817, 3.767029472049021, 4.81813909560203]

Evaluating Results

We can also evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics: root-mean-square error (RMSE) is a global summary of the differences between predicted values and the values actually observed, while max-error measures the worst case performance of the model on a single observation. In this example, our model made predictions which were about 1 star away from the true rating (on average) but there were a few cases where we were off by almost 4 stars.



In [10]:

    
model.evaluate(test_set)









    Out[10]:





{'max_error': 4.0190816636474285, 'rmse': 0.9710269452058631}

Let's go further in analyzing how well our model performed at predicting ratings. We perform a groupby-aggregate to calculate the average predicted rating (on the test set) for each value of the actual rating (1-5). This will help us understand when the model performs well and when it does not.



In [11]:

    
sf = gl.SFrame()
sf['Predicted-Rating'] = predictions
sf['Actual-Rating'] = test_set['stars']
predict_count = sf.groupby('Actual-Rating', [gl.aggregate.COUNT('Actual-Rating'), gl.aggregate.AVG('Predicted-Rating')])
predict_count.topk('Actual-Rating', k=5, reverse=True)









    Out[11]:





    
        Actual-Rating
        Count
        Avg of Predicted-Rating
    
    
        1
        3280
        2.64863703655
    
    
        2
        4003
        3.27077094185
    
    
        3
        6455
        3.5406504061
    
    
        4
        15150
        3.815774915
    
    
        5
        14383
        4.23685172744
    

[5 rows x 3 columns]

It looks like our model does well on ratings that were between 3 and 5 but not too well on ratings 1 and 2. One reason why this could happen is that we have a lot more reviews with 4 and 5 star ratings. In fact, the number of 4 and 5 star reviews is more than twice the number of reviews with 1-3 stars.

Interpreting Results

In addition to making predictions about new data, GraphLab's regression toolkit can provide valuable insight about the relationships between the target and feature columns in your data, revealing why your model returns the predictions that it does. Let's briefly venture into some mathematical details to explain. Linear regression models the target $Y$ as a linear combination of the feature variables $X_j$, random noise $\epsilon$, and a bias term ($\alpha_0$) (also known as the intercept or global offset):

$$Y = \alpha_0 + \sum_{j} \alpha_j X_j + \epsilon$$

The coefficients ($\alpha_j$) are what the training procedure learns. Each model coefficient describes the expected change in the target variable associated with a unit change in the feature. The bias term indicates the "inherent" or "average" target value if all feature values were set to zero.

The coefficients often tell an interesting story of how much each feature matters in predicting target values. The magnitude (absolute value) of the coefficient for each feature indicates the strength of the feature's association to the target variable, holding all other features constant. The sign on the coefficient (positive or negative) gives the direction of the association.

For a trained model, we can access the coefficients as follows. The name is the name of the feature, the index refers to a category for categorical variables, and the value is the value of the coefficient.



In [12]:

    
coefs = model['coefficients']
coefs









    Out[12]:





    
        name
        index
        value
    
    
        (intercept)
        None
        -2.22960549033
    
    
        user_avg_stars
        None
        0.810357163135
    
    
        business_avg_stars
        None
        0.781279217743
    
    
        user_review_count
        None
        1.97346316063e-05
    
    
        business_review_count
        None
        5.31674540185e-05
    

[5 rows x 3 columns]

Not surpisingly, high ratings are associated with (i) users who give a lot of high ratings on average, and (ii) businesses that receive high ratings on average. More interestingly, the number of reviews submitted by a user or recieved by a business appears to have a very weak association with ratings.

Binary Classification

Logistic regression is a model that is popularly used for classification tasks. In logistic regression, the probability that a binary target is True is modeled as a logistic function of the features.

First, let's construct a binary target variable. In this example, we will predict if a restaurant is good or bad, with 1 and 2 star ratings indicating a bad business and 3-5 star ratings indicating a good one.



In [13]:

    
user_business_review_table['is_good'] = user_business_review_table['stars'] >= 3

First, let's create a train-test split:



In [14]:

    
train_set, test_set = user_business_review_table.random_split(0.8, seed=1)

We will use the same set of features that we used for the linear regression model. Note that the API is very similar to the linear regression API.



In [15]:

    
model = gl.logistic_classifier.create(train_set, target="is_good", 
                                      features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'])









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 163868






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 4






    




PROGRESS: Number of unpacked features : 4






    




PROGRESS: Number of coefficients    : 5






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 2        | 0.265869     | 0.863842          | 0.863730            |






    




PROGRESS: | 2         | 3        | 0.422368     | 0.866545          | 0.867620            |






    




PROGRESS: | 3         | 4        | 0.578074     | 0.866807          | 0.867963            |






    




PROGRESS: | 4         | 5        | 0.733239     | 0.867009          | 0.868192            |






    




PROGRESS: | 5         | 6        | 0.880150     | 0.867009          | 0.868192            |






    




PROGRESS: | 6         | 7        | 1.030508     | 0.867009          | 0.868192            |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.







    




PROGRESS: SUCCESS: Optimal solution found.






    




PROGRESS:

Making Predictions (Probabilities, Classes, or Margins)

Logistic regression predictions can take one of three forms:

Classes (default) : Thresholds the probability estimate at 0.5 to predict a class label i.e. True/False.
Probabilities : A probability estimate (in the range [0,1]) that the example is in the True class.
Margins : Distance to the linear decision boundary learned by the model. The larger the distance, the more confidence we have that it belongs to one class or the other.

GraphLab's logistic regression model can return predictions for any of these types:



In [16]:

    
# Probability
predictions = model.predict(test_set)
predictions.head(5)









    Out[16]:





dtype: int
Rows: 5
[1, 1, 1, 1, 1]



In [17]:

    
predictions = model.predict(test_set, output_type = "margin")
predictions.head(5)









    Out[17]:





dtype: float
Rows: 5
[0.4689005631839578, 3.9376949950820297, 3.8817190122552834, 2.206682186278641, 4.370152167477832]



In [18]:

    
predictions = model.predict(test_set, output_type = "probability")
predictions.head(5)









    Out[18]:





dtype: float
Rows: 5
[0.6151235014526506, 0.9808796206839487, 0.9798010518655392, 0.9008479705745347, 0.9875086909040025]

Evaluating Results

We can evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics:

Classification Accuracy: Fraction of test set examples with correct class label predictions.
Confusion Matrix: Cross-tabulation of predicted and actual class labels.



In [19]:

    
result = model.evaluate(test_set)
print "Accuracy         : %s" % result['accuracy']
print "Confusion Matrix : \n%s" % result['confusion_matrix']









    



Accuracy         : 0.865036629613
Confusion Matrix : 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        0        |  2379 |
|      0       |        1        |  4904 |
|      1       |        1        | 35052 |
|      1       |        0        |  936  |
+--------------+-----------------+-------+
[4 rows x 3 columns]

GraphLab Create's evaluation toolkit contains more detail on evaluation metrics for both regression and classification. You are now good to go with regression! Be sure to check out our notebook on feature engineering to learn new tricks that can help you make better classifiers and predictors!

Multiclass Classification

Logistic Regression can also be used for multiclass classficiation. Multiclass classification allows each observation to be assigned to one of many categories (for example: ratings may be 1, 2, 3, 4, or 5). In this example, we will predict the rating of the restaurant.



In [20]:

    
model = gl.logistic_classifier.create(train_set, target="stars", 
                                      features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'])









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 163993






    




PROGRESS: Number of classes           : 5






    




PROGRESS: Number of feature columns   : 4






    




PROGRESS: Number of unpacked features : 4






    




PROGRESS: Number of coefficients    : 20






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 2        | 0.385790     | 0.450105          | 0.453976            |






    




PROGRESS: | 2         | 3        | 0.636195     | 0.476514          | 0.475566            |






    




PROGRESS: | 3         | 4        | 0.899109     | 0.476819          | 0.476146            |






    




PROGRESS: | 4         | 5        | 1.163658     | 0.476874          | 0.475914            |






    




PROGRESS: | 5         | 6        | 1.430248     | 0.476850          | 0.475914            |






    




PROGRESS: | 6         | 7        | 1.688117     | 0.476850          | 0.475914            |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS:

Statistics about the training data including the number of classes, the set of classes registered in the dataset, as well as the number of examples in each class are stored in the model.



In [21]:

    
print "This model has %s classes" % model['num_classes']
print "The set of classes in the training set are %s" % model['classes']









    



This model has 5 classes
The set of classes in the training set are [1, 2, 3, 4, 5]

Top-k predictions with multiclass classfication

While training models for multiclass classification, the top-k predictions can be of the following type.

Probabilities (default): A probability estimate (in the range [0,1]) that the example is in the predicted class.
Margins : A score that reflects the confidence we have that the example belongs to the predicted class. The larger the score, the greater the confidence.
Rank : A rank (from 1-k) that the example belongs to the predicted class.

In the following example, we calculate the top-2 probabilities, margins, and ranks of predictions.



In [22]:

    
predictions = model.predict_topk(test_set, output_type = 'probability', k = 2)
predictions.head(5)









    Out[22]:





    
        id
        class
        probability
    
    
        0
        4
        0.290599734464
    
    
        0
        3
        0.245391286641
    
    
        1
        5
        0.708112474661
    
    
        1
        4
        0.244961580442
    
    
        2
        5
        0.673094718076
    

[5 rows x 3 columns]



In [23]:

    
predictions = model.predict_topk(test_set, output_type = 'margin', k = 2)
predictions.head(5)









    Out[23]:





    
        id
        class
        margin
    
    
        0
        4
        0.43957010408
    
    
        0
        3
        0.27047729158
    
    
        1
        5
        5.9099428457
    
    
        1
        4
        4.84844128583
    
    
        2
        5
        5.64011549718
    

[5 rows x 3 columns]



In [24]:

    
predictions = model.predict_topk(test_set, output_type = 'rank', k = 2)
predictions.head(5)









    Out[24]:





    
        id
        class
        rank
    
    
        0
        4
        0
    
    
        0
        3
        1
    
    
        1
        5
        0
    
    
        1
        4
        1
    
    
        2
        5
        0
    

[5 rows x 3 columns]

Evaluation

Similar to binary classifications, we can evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics:

Classification Accuracy: Fraction of test set examples with correct class label predictions.
Confusion Matrix: Cross-tabulation of predicted and actual class labels.



In [25]:

    
result = model.evaluate(test_set)
print "Confusion Matrix : \n%s" % result['confusion_matrix']









    



Confusion Matrix : 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      1       |        3        |  127  |
|      4       |        2        |   18  |
|      4       |        1        |  391  |
|      2       |        1        |  778  |
|      1       |        4        |  1374 |
|      5       |        2        |   6   |
|      1       |        1        |  1604 |
|      2       |        4        |  2620 |
|      3       |        2        |   10  |
|      5       |        3        |   44  |
+--------------+-----------------+-------+
[25 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Imbalanced Datasets

Many difficult real-world problems have imbalanced data, where at least one class is under-represented. GraphLab Create models can improve prediction quality for some unbalanced scenarios by assigning different costs to misclassification errors for different classes.

Let us see the distribution of examples for each class in the dataset.



In [26]:

    
review['stars'].astype(str).show()



In [27]:

    
model = gl.logistic_classifier.create(train_set, target="stars", 
                                      features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'], 
                                      class_weights = 'auto')









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 164044






    




PROGRESS: Number of classes           : 5






    




PROGRESS: Number of feature columns   : 4






    




PROGRESS: Number of unpacked features : 4






    




PROGRESS: Number of coefficients    : 20






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 2        | 0.336499     | 0.381794          | 0.383699            |






    




PROGRESS: | 2         | 3        | 0.582315     | 0.419692          | 0.416978            |






    




PROGRESS: | 3         | 4        | 0.834512     | 0.425764          | 0.426553            |






    




PROGRESS: | 4         | 5        | 1.088540     | 0.426422          | 0.429472            |






    




PROGRESS: | 5         | 6        | 1.328803     | 0.426477          | 0.429239            |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: SUCCESS: Optimal solution found.



In [28]:

    
result = model.evaluate(test_set)
print "Confusion Matrix : \n%s" % result['confusion_matrix']









    



Confusion Matrix : 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      5       |        3        |  821  |
|      1       |        3        |  263  |
|      3       |        2        |  1448 |
|      2       |        4        |  726  |
|      1       |        1        |  1971 |
|      5       |        4        |  2899 |
|      2       |        2        |  1007 |
|      2       |        3        |  676  |
|      5       |        2        |  963  |
|      1       |        2        |  616  |
+--------------+-----------------+-------+
[25 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

business_id	date	review_id	stars	text	type
9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for break ...	review
ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad reviews ...	review
6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I also ...	review
_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!! ...	review
6ozycU1RpktNG2-1BroVtw	2012-01-05	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!! ...	review

user_id	votes	year	month	day	categories	city
rLtl8ZkDX5vH5nAx9C3q5Q	{'funny': 0, 'useful': 5, 'cool': 2} ...	2011	1	26	[Breakfast & Brunch, Restaurants] ...	Phoenix
0a2KyEL0d3Yb1V6aivbIuQ	{'funny': 0, 'useful': 0, 'cool': 0} ...	2011	7	27	[Italian, Pizza, Restaurants] ...	Phoenix
0hT2KtfLiobPvh6cDC8JQg	{'funny': 0, 'useful': 1, 'cool': 0} ...	2012	6	14	[Middle Eastern, Restaurants] ...	Tempe
uZetl9T0NcROGOyFfughhg	{'funny': 0, 'useful': 2, 'cool': 1} ...	2010	5	27	[Active Life, Dog Parks, Parks] ...	Scottsdale
vYmM4KTsC8ZfQBg-j5MWkw	{'funny': 0, 'useful': 0, 'cool': 0} ...	2012	1	5	[Tires, Automotive]	Mesa

full_address	latitude	longitude	name	open	business_review_count	business_avg_stars
6106 S 32nd St\nPhoenix, AZ 85042 ...	33.3908	-112.013	Morning Glory Cafe	1	116	4.0
4848 E Chandler Blvd\nPhoenix, AZ 85044 ...	33.3056	-111.979	Spinato's Pizzeria	1	102	4.0
1513 E Apache Blvd\nTempe, AZ 85281 ...	33.4143	-111.913	Haji-Baba	1	265	4.5
5401 N Hayden Rd\nScottsdale, AZ 85250 ...	33.5229	-111.908	Chaparral Dog Park	1	88	4.5
1357 S Power Road\nMesa, AZ 85206 ...	33.391	-111.684	Discount Tire	1	5	4.5

state	business_type	user_avg_stars	user_name	user_review_count	user_type	votes_funny	votes_cool	votes_useful
AZ	business	3.72	Jason	376	user	331	322	1034
AZ	business	5.0	Paul	2	user	2	0	0
AZ	business	4.33	Nicole	3	user	0	0	3
AZ	business	4.29	lindsey	31	user	18	36	75
AZ	business	3.25	Roger	28	user	3	8	32

Actual-Rating	Count	Avg of Predicted-Rating
1	3280	2.64863703655
2	4003	3.27077094185
3	6455	3.5406504061
4	15150	3.815774915
5	14383	4.23685172744

name	index	value
(intercept)	None	-2.22960549033
user_avg_stars	None	0.810357163135
business_avg_stars	None	0.781279217743
user_review_count	None	1.97346316063e-05
business_review_count	None	5.31674540185e-05

id	class	probability
0	4	0.290599734464
0	3	0.245391286641
1	5	0.708112474661
1	4	0.244961580442
2	5	0.673094718076

id	class	margin
0	4	0.43957010408
0	3	0.27047729158
1	5	5.9099428457
1	4	4.84844128583
2	5	5.64011549718