In [1]:

    
import graphlab as gl



In [2]:

    
!head -n 2 ../data/yelp/yelp_training_set_review.json









    



{"votes": {"funny": 0, "useful": 5, "cool": 2}, "user_id": "rLtl8ZkDX5vH5nAx9C3q5Q", "review_id": "fWKvX83p0-ka4JS3dc6E5A", "stars": 5, "date": "2011-01-26", "text": "My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best \"toast\" I've ever had.\n\nAnyway, I can't wait to go back!", "type": "review", "business_id": "9yKzy9PApeiPPOUJEtnvkg"}
{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "0a2KyEL0d3Yb1V6aivbIuQ", "review_id": "IjZ33sJrzXqU-0X6U8NwyA", "stars": 5, "date": "2011-07-27", "text": "I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\n\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small \"Here's The Beef\" pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11\" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that's how I like my sauce!\n\nWe had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that  you have to try these things yourself because all these bad reviewers have some serious issues.", "type": "review", "business_id": "ZRJwVLyzEJq1VAihDhYiow"}

SFrame -- Scalable Dataframe

Powerful unstructured data processing: read straight up json



In [3]:

    
reviews = gl.SFrame.read_csv('../data/yelp/yelp_training_set_review.json', header=False)
reviews









    



[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-28139 - Server binary: /Users/alicez/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1443318283.log
[INFO] GraphLab Server Version: 1.6.1






    



PROGRESS: Finished parsing file /Users/alicez/Documents/training/Strata NYC 2015/data/yelp/yelp_training_set_review.json
PROGRESS: Parsing completed. Parsed 100 lines in 0.85514 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 55824 lines. Lines per second: 34072.2
PROGRESS: Finished parsing file /Users/alicez/Documents/training/Strata NYC 2015/data/yelp/yelp_training_set_review.json
PROGRESS: Parsing completed. Parsed 229907 lines in 4.56457 secs.






    Out[3]:





    
        X1
    
    
        {'votes': {'funny': 0,
'useful': 5, 'cool': 2}, ...
    
    
        {'votes': {'funny': 0,
'useful': 0, 'cool': 0}, ...
    
    
        {'votes': {'funny': 0,
'useful': 1, 'cool': 0}, ...
    
    
        {'votes': {'funny': 0,
'useful': 2, 'cool': 1}, ...
    
    
        {'votes': {'funny': 0,
'useful': 0, 'cool': 0}, ...
    
    
        {'votes': {'funny': 1,
'useful': 3, 'cool': 4}, ...
    
    
        {'votes': {'funny': 4,
'useful': 7, 'cool': 7}, ...
    
    
        {'votes': {'funny': 0,
'useful': 1, 'cool': 0}, ...
    
    
        {'votes': {'funny': 0,
'useful': 0, 'cool': 0}, ...
    
    
        {'votes': {'funny': 0,
'useful': 1, 'cool': 0}, ...
    

[229907 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [4]:

    
reviews[0]









    Out[4]:





{'X1': {'business_id': '9yKzy9PApeiPPOUJEtnvkg',
  'date': '2011-01-26',
  'review_id': 'fWKvX83p0-ka4JS3dc6E5A',
  'stars': 5,
  'text': 'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!',
  'type': 'review',
  'user_id': 'rLtl8ZkDX5vH5nAx9C3q5Q',
  'votes': {'cool': 2, 'funny': 0, 'useful': 5}}}

Unpack to extract structure



In [5]:

    
reviews=reviews.unpack('X1','')
reviews









    Out[5]:





    
        business_id
        date
        review_id
        stars
        text
        type
    
    
        9yKzy9PApeiPPOUJEtnvkg
        2011-01-26
        fWKvX83p0-ka4JS3dc6E5A
        5
        My wife took me here on
my birthday for break ...
        review
    
    
        ZRJwVLyzEJq1VAihDhYiow
        2011-07-27
        IjZ33sJrzXqU-0X6U8NwyA
        5
        I have no idea why some
people give bad reviews ...
        review
    
    
        6oRAC4uyJCsJl1X0WZpVSA
        2012-06-14
        IESLBzqUCLdSzSqm0eCSxQ
        4
        love the gyro plate. Rice
is so good and I also ...
        review
    
    
        _1QQZuf4zZOyFCvXc0o6Vg
        2010-05-27
        G-WvGaISbqqaMHlNnByodA
        5
        Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
        review
    
    
        6ozycU1RpktNG2-1BroVtw
        2012-01-05
        1uJFq2r5QfJG_6ExMRCaGw
        5
        General Manager Scott
Petello is a good egg!!! ...
        review
    
    
        -yxfBYGB6SEqszmxJxd97A
        2007-12-13
        m2CKSsepBCoRYWxiRUsxAg
        4
        Quiessence is, simply
put, beautiful.  Full ...
        review
    
    
        zp713qNhx8d9KCJJnrw1xA
        2010-02-12
        riFQ3vxNpP4rWLk_CSri2A
        5
        Drop what you're doing
and drive here. After I ...
        review
    
    
        hW0Ne_HTHEAgGF1rAdmR-g
        2012-07-12
        JL7GXJ9u4YMx7Rzs05NfiQ
        4
        Luckily, I didn't have to
travel far to make my ...
        review
    
    
        wNUea3IXZWD63bbOQaOH-g
        2012-08-17
        XtnfnYmnJYi71yIuGsXIUA
        4
        Definitely come for Happy
hour! Prices are amaz ...
        review
    
    
        nMHhuYan8e3cONo3PornJA
        2010-08-11
        jJAIXA46pU1swYyRCdfXtQ
        5
        Nobuo shows his unique
talents with everything ...
        review
    


    
        user_id
        votes
    
    
        rLtl8ZkDX5vH5nAx9C3q5Q
        {'funny': 0, 'useful': 5,
'cool': 2} ...
    
    
        0a2KyEL0d3Yb1V6aivbIuQ
        {'funny': 0, 'useful': 0,
'cool': 0} ...
    
    
        0hT2KtfLiobPvh6cDC8JQg
        {'funny': 0, 'useful': 1,
'cool': 0} ...
    
    
        uZetl9T0NcROGOyFfughhg
        {'funny': 0, 'useful': 2,
'cool': 1} ...
    
    
        vYmM4KTsC8ZfQBg-j5MWkw
        {'funny': 0, 'useful': 0,
'cool': 0} ...
    
    
        sqYN3lNgvPbPCTRsMFu27g
        {'funny': 1, 'useful': 3,
'cool': 4} ...
    
    
        wFweIWhv2fREZV_dYkz_1g
        {'funny': 4, 'useful': 7,
'cool': 7} ...
    
    
        1ieuYcKS7zeAv_U15AB13A
        {'funny': 0, 'useful': 1,
'cool': 0} ...
    
    
        Vh_DlizgGhSqQh4qfZ2h6A
        {'funny': 0, 'useful': 0,
'cool': 0} ...
    
    
        sUNkXg8-KFtCMQDV6zRzQg
        {'funny': 0, 'useful': 1,
'cool': 0} ...
    

[229907 rows x 8 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Votes are still crammed in a dictionary. Let's unpack it.



In [6]:

    
reviews = reviews.unpack('votes', '')
reviews









    Out[6]:





    
        business_id
        date
        review_id
        stars
        text
        type
    
    
        9yKzy9PApeiPPOUJEtnvkg
        2011-01-26
        fWKvX83p0-ka4JS3dc6E5A
        5
        My wife took me here on
my birthday for break ...
        review
    
    
        ZRJwVLyzEJq1VAihDhYiow
        2011-07-27
        IjZ33sJrzXqU-0X6U8NwyA
        5
        I have no idea why some
people give bad reviews ...
        review
    
    
        6oRAC4uyJCsJl1X0WZpVSA
        2012-06-14
        IESLBzqUCLdSzSqm0eCSxQ
        4
        love the gyro plate. Rice
is so good and I also ...
        review
    
    
        _1QQZuf4zZOyFCvXc0o6Vg
        2010-05-27
        G-WvGaISbqqaMHlNnByodA
        5
        Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
        review
    
    
        6ozycU1RpktNG2-1BroVtw
        2012-01-05
        1uJFq2r5QfJG_6ExMRCaGw
        5
        General Manager Scott
Petello is a good egg!!! ...
        review
    
    
        -yxfBYGB6SEqszmxJxd97A
        2007-12-13
        m2CKSsepBCoRYWxiRUsxAg
        4
        Quiessence is, simply
put, beautiful.  Full ...
        review
    
    
        zp713qNhx8d9KCJJnrw1xA
        2010-02-12
        riFQ3vxNpP4rWLk_CSri2A
        5
        Drop what you're doing
and drive here. After I ...
        review
    
    
        hW0Ne_HTHEAgGF1rAdmR-g
        2012-07-12
        JL7GXJ9u4YMx7Rzs05NfiQ
        4
        Luckily, I didn't have to
travel far to make my ...
        review
    
    
        wNUea3IXZWD63bbOQaOH-g
        2012-08-17
        XtnfnYmnJYi71yIuGsXIUA
        4
        Definitely come for Happy
hour! Prices are amaz ...
        review
    
    
        nMHhuYan8e3cONo3PornJA
        2010-08-11
        jJAIXA46pU1swYyRCdfXtQ
        5
        Nobuo shows his unique
talents with everything ...
        review
    


    
        user_id
        cool
        funny
        useful
    
    
        rLtl8ZkDX5vH5nAx9C3q5Q
        2
        0
        5
    
    
        0a2KyEL0d3Yb1V6aivbIuQ
        0
        0
        0
    
    
        0hT2KtfLiobPvh6cDC8JQg
        0
        0
        1
    
    
        uZetl9T0NcROGOyFfughhg
        1
        0
        2
    
    
        vYmM4KTsC8ZfQBg-j5MWkw
        0
        0
        0
    
    
        sqYN3lNgvPbPCTRsMFu27g
        4
        1
        3
    
    
        wFweIWhv2fREZV_dYkz_1g
        7
        4
        7
    
    
        1ieuYcKS7zeAv_U15AB13A
        0
        0
        1
    
    
        Vh_DlizgGhSqQh4qfZ2h6A
        0
        0
        0
    
    
        sUNkXg8-KFtCMQDV6zRzQg
        0
        0
        1
    

[229907 rows x 10 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Quick data visualization



In [7]:

    
reviews.show()









    



Canvas is accessible via web browser at the URL: http://localhost:52499/index.html
Opening Canvas in default web browser.

Represent datetime



In [8]:

    
reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')

Munge votes and add a new column



In [9]:

    
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']
reviews









    Out[9]:





    
        business_id
        date
        review_id
        stars
        text
        type
    
    
        9yKzy9PApeiPPOUJEtnvkg
        2011-01-26 00:00:00
        fWKvX83p0-ka4JS3dc6E5A
        5
        My wife took me here on
my birthday for break ...
        review
    
    
        ZRJwVLyzEJq1VAihDhYiow
        2011-07-27 00:00:00
        IjZ33sJrzXqU-0X6U8NwyA
        5
        I have no idea why some
people give bad reviews ...
        review
    
    
        6oRAC4uyJCsJl1X0WZpVSA
        2012-06-14 00:00:00
        IESLBzqUCLdSzSqm0eCSxQ
        4
        love the gyro plate. Rice
is so good and I also ...
        review
    
    
        _1QQZuf4zZOyFCvXc0o6Vg
        2010-05-27 00:00:00
        G-WvGaISbqqaMHlNnByodA
        5
        Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
        review
    
    
        6ozycU1RpktNG2-1BroVtw
        2012-01-05 00:00:00
        1uJFq2r5QfJG_6ExMRCaGw
        5
        General Manager Scott
Petello is a good egg!!! ...
        review
    
    
        -yxfBYGB6SEqszmxJxd97A
        2007-12-13 00:00:00
        m2CKSsepBCoRYWxiRUsxAg
        4
        Quiessence is, simply
put, beautiful.  Full ...
        review
    
    
        zp713qNhx8d9KCJJnrw1xA
        2010-02-12 00:00:00
        riFQ3vxNpP4rWLk_CSri2A
        5
        Drop what you're doing
and drive here. After I ...
        review
    
    
        hW0Ne_HTHEAgGF1rAdmR-g
        2012-07-12 00:00:00
        JL7GXJ9u4YMx7Rzs05NfiQ
        4
        Luckily, I didn't have to
travel far to make my ...
        review
    
    
        wNUea3IXZWD63bbOQaOH-g
        2012-08-17 00:00:00
        XtnfnYmnJYi71yIuGsXIUA
        4
        Definitely come for Happy
hour! Prices are amaz ...
        review
    
    
        nMHhuYan8e3cONo3PornJA
        2010-08-11 00:00:00
        jJAIXA46pU1swYyRCdfXtQ
        5
        Nobuo shows his unique
talents with everything ...
        review
    


    
        user_id
        cool
        funny
        useful
        total_votes
    
    
        rLtl8ZkDX5vH5nAx9C3q5Q
        2
        0
        5
        7
    
    
        0a2KyEL0d3Yb1V6aivbIuQ
        0
        0
        0
        0
    
    
        0hT2KtfLiobPvh6cDC8JQg
        0
        0
        1
        1
    
    
        uZetl9T0NcROGOyFfughhg
        1
        0
        2
        3
    
    
        vYmM4KTsC8ZfQBg-j5MWkw
        0
        0
        0
        0
    
    
        sqYN3lNgvPbPCTRsMFu27g
        4
        1
        3
        8
    
    
        wFweIWhv2fREZV_dYkz_1g
        7
        4
        7
        18
    
    
        1ieuYcKS7zeAv_U15AB13A
        0
        0
        1
        1
    
    
        Vh_DlizgGhSqQh4qfZ2h6A
        0
        0
        0
        0
    
    
        sUNkXg8-KFtCMQDV6zRzQg
        0
        0
        1
        1
    

[229907 rows x 11 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Filter rows to remove reviews with no votes



In [10]:

    
reviews['total_votes'] > 0









    Out[10]:





dtype: int
Rows: 229907
[1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, ... ]



In [11]:

    
reviews = reviews[reviews['total_votes'] > 0]
reviews









    Out[11]:





    
        business_id
        date
        review_id
        stars
        text
        type
    
    
        9yKzy9PApeiPPOUJEtnvkg
        2011-01-26 00:00:00
        fWKvX83p0-ka4JS3dc6E5A
        5
        My wife took me here on
my birthday for break ...
        review
    
    
        6oRAC4uyJCsJl1X0WZpVSA
        2012-06-14 00:00:00
        IESLBzqUCLdSzSqm0eCSxQ
        4
        love the gyro plate. Rice
is so good and I also ...
        review
    
    
        _1QQZuf4zZOyFCvXc0o6Vg
        2010-05-27 00:00:00
        G-WvGaISbqqaMHlNnByodA
        5
        Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
        review
    
    
        -yxfBYGB6SEqszmxJxd97A
        2007-12-13 00:00:00
        m2CKSsepBCoRYWxiRUsxAg
        4
        Quiessence is, simply
put, beautiful.  Full ...
        review
    
    
        zp713qNhx8d9KCJJnrw1xA
        2010-02-12 00:00:00
        riFQ3vxNpP4rWLk_CSri2A
        5
        Drop what you're doing
and drive here. After I ...
        review
    
    
        hW0Ne_HTHEAgGF1rAdmR-g
        2012-07-12 00:00:00
        JL7GXJ9u4YMx7Rzs05NfiQ
        4
        Luckily, I didn't have to
travel far to make my ...
        review
    
    
        nMHhuYan8e3cONo3PornJA
        2010-08-11 00:00:00
        jJAIXA46pU1swYyRCdfXtQ
        5
        Nobuo shows his unique
talents with everything ...
        review
    
    
        AsSCv0q_BWqIe3mX2JqsOQ
        2010-06-16 00:00:00
        E11jzpKz9Kw5K7fuARWfRw
        5
        The oldish man who owns
the store is as sweet as ...
        review
    
    
        e9nN4XxjdHj4qtKCOPq_vg
        2011-10-21 00:00:00
        3rPt0LxF7rgmEUrznoH22w
        5
        Wonderful Vietnamese
sandwich shoppe. Their ...
        review
    
    
        h53YuCiIDfEFSJCQpk8v1g
        2010-01-11 00:00:00
        cGnKNX3I9rthE0-TH24-qA
        5
        They have a limited time
thing going on right now ...
        review
    


    
        user_id
        cool
        funny
        useful
        total_votes
    
    
        rLtl8ZkDX5vH5nAx9C3q5Q
        2
        0
        5
        7
    
    
        0hT2KtfLiobPvh6cDC8JQg
        0
        0
        1
        1
    
    
        uZetl9T0NcROGOyFfughhg
        1
        0
        2
        3
    
    
        sqYN3lNgvPbPCTRsMFu27g
        4
        1
        3
        8
    
    
        wFweIWhv2fREZV_dYkz_1g
        7
        4
        7
        18
    
    
        1ieuYcKS7zeAv_U15AB13A
        0
        0
        1
        1
    
    
        sUNkXg8-KFtCMQDV6zRzQg
        0
        0
        1
        1
    
    
        -OMlS6yWkYjVldNhC31wYg
        1
        1
        3
        5
    
    
        C1rHp3dmepNea7XiouwB6Q
        1
        0
        1
        2
    
    
        UPtysDF6cUDUxq2KY-6Dcg
        1
        0
        2
        3
    

[? rows x 11 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

Classifiction task

Predict which reviews will be voted "funny," based on review text.

First, the labels. Reviews with at least one vote for "funny" is funny.



In [12]:

    
reviews['funny'] = reviews['funny'] > 0



In [13]:

    
reviews = reviews[['text','funny']]
reviews









    Out[13]:





    
        text
        funny
    
    
        My wife took me here on
my birthday for break ...
        0
    
    
        love the gyro plate. Rice
is so good and I also ...
        0
    
    
        Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
        0
    
    
        Quiessence is, simply
put, beautiful.  Full ...
        1
    
    
        Drop what you're doing
and drive here. After I ...
        1
    
    
        Luckily, I didn't have to
travel far to make my ...
        0
    
    
        Nobuo shows his unique
talents with everything ...
        0
    
    
        The oldish man who owns
the store is as sweet as ...
        1
    
    
        Wonderful Vietnamese
sandwich shoppe. Their ...
        0
    
    
        They have a limited time
thing going on right now ...
        0
    

[147084 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

To save time, take just a small subset



In [14]:

    
reviews = reviews[:10000]

Create bag-of-words representation of text



In [15]:

    
word_delims = ["\r", "\v", "\n", "\f", "\t", " ", 
               '~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=', 
               ',', '.', ';', ':', '\"', '?', '|', '\\', '/', 
               '<', '>', '(', ')', '[', ']', '{', '}']

reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)

Create tf-idf representation of the bag of words



In [16]:

    
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])



In [17]:

    
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])



In [18]:

    
reviews









    Out[18]:





    
        text
        funny
        bow
        tf_idf
    
    
        My wife took me here on
my birthday for break ...
        0
        {'anyway': 1, 'looks': 1,
'go': 1, 'toast': 1, ...
        {'anyway':
3.564893474332945, ...
    
    
        love the gyro plate. Rice
is so good and I also ...
        0
        {'and': 1, 'plate': 1,
'selection': 1, 'love': ...
        {'and':
0.08621169681906551, ...
    
    
        Rosie, Dakota, and I LOVE
Chaparral Dog Park!!! ...
        0
        {'and': 8, 'does': 1,
'all': 1, 'surrounded': ...
        {'and':
0.6896935745525241, ...
    
    
        Quiessence is, simply
put, beautiful.  Full ...
        1
        {'45': 1, 'seated': 1,
'just': 1, 'bring': 1, ...
        {'just':
1.0552654841647438, ...
    
    
        Drop what you're doing
and drive here. After I ...
        1
        {'cute': 1, 'condesa': 1,
'desolate': 1, 'mexic ...
        {'cute':
3.575550768806933, ...
    
    
        Luckily, I didn't have to
travel far to make my ...
        0
        {'and': 2, 'presence': 1,
"didn't": 1, 'as': 1, ...
        {'and':
0.17242339363813103, ...
    
    
        Nobuo shows his unique
talents with everything ...
        0
        {'and': 1, 'pork': 1,
'features': 1, 'go': 1, ...
        {'and':
0.08621169681906551, ...
    
    
        The oldish man who owns
the store is as sweet as ...
        1
        {'and': 1, 'cookies': 2,
'sweet': 1, 'is': 1, ...
        {'and':
0.08621169681906551, ...
    
    
        Wonderful Vietnamese
sandwich shoppe. Their ...
        0
        {'selection': 1, 'have':
2, 'baguettes': 1, ...
        {'selection':
2.6882475738060303, ...
    
    
        They have a limited time
thing going on right now ...
        0
        {'and': 1, 'limited': 1,
'all': 1, 'on': 1, ...
        {'and':
0.08621169681906551, ...
    

[10000 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Create a train-test split

Returns immediately because SFrame operations are lazily evaluated.



In [19]:

    
train_sf, test_sf = reviews.random_split(0.8)

Train classifiers on bow and tf-idf

Dictionaries are automatically interpreted as sparse features.

Not demonstrated here, but any string/categorical columns are automatically interpreted as sparse features as well.



In [20]:

    
m1 = gl.logistic_classifier.create(train_sf, 
                                   'funny', 
                                   features=['bow'], 
                                   validation_set=None, 
                                   feature_rescaling=False)









    



PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 7992
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 30012
PROGRESS: Number of coefficients    : 30013
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | 1         | 6        | 0.000002  | 1.246551     | 0.474099          |
PROGRESS: | 2         | 9        | 5.000000  | 1.400293     | 0.539039          |
PROGRESS: | 3         | 10       | 5.000000  | 1.475651     | 0.588088          |
PROGRESS: | 4         | 11       | 5.000000  | 1.547559     | 0.522523          |
PROGRESS: | 5         | 13       | 1.000000  | 1.654165     | 0.592593          |
PROGRESS: | 6         | 14       | 1.000000  | 1.719261     | 0.564565          |
PROGRESS: | 10        | 19       | 1.000000  | 2.110180     | 0.622873          |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+



In [21]:

    
m2 = gl.logistic_classifier.create(train_sf, 
                                   'funny', 
                                   features=['tf_idf'], 
                                   validation_set=None, 
                                   feature_rescaling=False)









    



PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 7992
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 30012
PROGRESS: Number of coefficients    : 30013
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+
PROGRESS: | 1         | 6        | 0.000011  | 0.383228     | 0.481231          |
PROGRESS: | 2         | 9        | 5.000000  | 0.620402     | 0.678929          |
PROGRESS: | 3         | 10       | 5.000000  | 0.731355     | 0.736862          |
PROGRESS: | 4         | 15       | 0.082708  | 1.089782     | 0.750375          |
PROGRESS: | 5         | 16       | 0.082708  | 1.211477     | 0.750501          |
PROGRESS: | 6         | 17       | 0.082708  | 1.371576     | 0.760511          |
PROGRESS: | 10        | 21       | 0.082708  | 1.827941     | 0.793293          |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+

Evaluate on validation set and compare performance



In [22]:

    
m1_res = m1.evaluate(test_sf)
m1_res









    Out[22]:





{'accuracy': 0.6055776892430279, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  266  |
 |      0       |        0        |  778  |
 |      1       |        1        |  438  |
 |      1       |        0        |  526  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}



In [23]:

    
m2_res = m2.evaluate(test_sf)
m2_res









    Out[23]:





{'accuracy': 0.6170318725099602, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  351  |
 |      0       |        0        |  693  |
 |      1       |        1        |  546  |
 |      1       |        0        |  418  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

Baseline accuracy (what if we classify everything as the majority class)

Percentage of 'funny' reviews:



In [25]:

    
float(test_sf['funny'].sum())/test_sf.num_rows()









    Out[25]:





0.4800796812749004

Percentage of not funny reviews:



In [26]:

    
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()









    Out[26]:





0.5199203187250996



In [ ]:

business_id	date	review_id	stars	text	type
9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for break ...	review
ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad reviews ...	review
6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I also ...	review
_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!! ...	review
6ozycU1RpktNG2-1BroVtw	2012-01-05	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!! ...	review
-yxfBYGB6SEqszmxJxd97A	2007-12-13	m2CKSsepBCoRYWxiRUsxAg	4	Quiessence is, simply put, beautiful. Full ...	review
zp713qNhx8d9KCJJnrw1xA	2010-02-12	riFQ3vxNpP4rWLk_CSri2A	5	Drop what you're doing and drive here. After I ...	review
hW0Ne_HTHEAgGF1rAdmR-g	2012-07-12	JL7GXJ9u4YMx7Rzs05NfiQ	4	Luckily, I didn't have to travel far to make my ...	review
wNUea3IXZWD63bbOQaOH-g	2012-08-17	XtnfnYmnJYi71yIuGsXIUA	4	Definitely come for Happy hour! Prices are amaz ...	review
nMHhuYan8e3cONo3PornJA	2010-08-11	jJAIXA46pU1swYyRCdfXtQ	5	Nobuo shows his unique talents with everything ...	review

user_id	votes
rLtl8ZkDX5vH5nAx9C3q5Q	{'funny': 0, 'useful': 5, 'cool': 2} ...
0a2KyEL0d3Yb1V6aivbIuQ	{'funny': 0, 'useful': 0, 'cool': 0} ...
0hT2KtfLiobPvh6cDC8JQg	{'funny': 0, 'useful': 1, 'cool': 0} ...
uZetl9T0NcROGOyFfughhg	{'funny': 0, 'useful': 2, 'cool': 1} ...
vYmM4KTsC8ZfQBg-j5MWkw	{'funny': 0, 'useful': 0, 'cool': 0} ...
sqYN3lNgvPbPCTRsMFu27g	{'funny': 1, 'useful': 3, 'cool': 4} ...
wFweIWhv2fREZV_dYkz_1g	{'funny': 4, 'useful': 7, 'cool': 7} ...
1ieuYcKS7zeAv_U15AB13A	{'funny': 0, 'useful': 1, 'cool': 0} ...
Vh_DlizgGhSqQh4qfZ2h6A	{'funny': 0, 'useful': 0, 'cool': 0} ...
sUNkXg8-KFtCMQDV6zRzQg	{'funny': 0, 'useful': 1, 'cool': 0} ...

business_id	date	review_id	stars	text	type
9yKzy9PApeiPPOUJEtnvkg	2011-01-26 00:00:00	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for break ...	review
ZRJwVLyzEJq1VAihDhYiow	2011-07-27 00:00:00	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad reviews ...	review
6oRAC4uyJCsJl1X0WZpVSA	2012-06-14 00:00:00	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I also ...	review
_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27 00:00:00	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!! ...	review
6ozycU1RpktNG2-1BroVtw	2012-01-05 00:00:00	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!! ...	review
-yxfBYGB6SEqszmxJxd97A	2007-12-13 00:00:00	m2CKSsepBCoRYWxiRUsxAg	4	Quiessence is, simply put, beautiful. Full ...	review
zp713qNhx8d9KCJJnrw1xA	2010-02-12 00:00:00	riFQ3vxNpP4rWLk_CSri2A	5	Drop what you're doing and drive here. After I ...	review
hW0Ne_HTHEAgGF1rAdmR-g	2012-07-12 00:00:00	JL7GXJ9u4YMx7Rzs05NfiQ	4	Luckily, I didn't have to travel far to make my ...	review
wNUea3IXZWD63bbOQaOH-g	2012-08-17 00:00:00	XtnfnYmnJYi71yIuGsXIUA	4	Definitely come for Happy hour! Prices are amaz ...	review
nMHhuYan8e3cONo3PornJA	2010-08-11 00:00:00	jJAIXA46pU1swYyRCdfXtQ	5	Nobuo shows his unique talents with everything ...	review

text	funny	bow	tf_idf
My wife took me here on my birthday for break ...	0	{'anyway': 1, 'looks': 1, 'go': 1, 'toast': 1, ...	{'anyway': 3.564893474332945, ...
love the gyro plate. Rice is so good and I also ...	0	{'and': 1, 'plate': 1, 'selection': 1, 'love': ...	{'and': 0.08621169681906551, ...
Rosie, Dakota, and I LOVE Chaparral Dog Park!!! ...	0	{'and': 8, 'does': 1, 'all': 1, 'surrounded': ...	{'and': 0.6896935745525241, ...
Quiessence is, simply put, beautiful. Full ...	1	{'45': 1, 'seated': 1, 'just': 1, 'bring': 1, ...	{'just': 1.0552654841647438, ...
Drop what you're doing and drive here. After I ...	1	{'cute': 1, 'condesa': 1, 'desolate': 1, 'mexic ...	{'cute': 3.575550768806933, ...
Luckily, I didn't have to travel far to make my ...	0	{'and': 2, 'presence': 1, "didn't": 1, 'as': 1, ...	{'and': 0.17242339363813103, ...
Nobuo shows his unique talents with everything ...	0	{'and': 1, 'pork': 1, 'features': 1, 'go': 1, ...	{'and': 0.08621169681906551, ...
The oldish man who owns the store is as sweet as ...	1	{'and': 1, 'cookies': 2, 'sweet': 1, 'is': 1, ...	{'and': 0.08621169681906551, ...
Wonderful Vietnamese sandwich shoppe. Their ...	0	{'selection': 1, 'have': 2, 'baguettes': 1, ...	{'selection': 2.6882475738060303, ...
They have a limited time thing going on right now ...	0	{'and': 1, 'limited': 1, 'all': 1, 'on': 1, ...	{'and': 0.08621169681906551, ...