CSE 6040, Fall 2015: Homework #2

This assignment has the following learning goals:

Reinforce your data preprocessing skills, on basic files and in SQL, using a real dataset.
Allow you to flex your creative muscles on an open-ended data analysis problem.

In particular, you will work on the data provided for the 2015 Yelp! Dataset Challenge. Start by downloading this data and reviewing the information at that page under the section heading, Notes on the Dataset.

Note that the official challenge from Yelp! is an open competition to produce the coolest analysis of their dataset. The "open-ended" part of this homework assignment might be a first step toward helping your team win the $5,000 prize! (Entries for that competition are due December 31.)

You may work in teams of up to 2 students each. If you want a partner but can't find one, try using some combination of your basic social skills, the instructor's knowledge of the class, and Piazza to reach out to your peers.

Upload your completed version of this notebook plus all your output files to T-Square by Friday, October 16, 2015 at 5:00pm Eastern.

Actually, we will set up T-Square to accept the assignment on Friday, Oct 16, 2015 "anywhere on earth" (AOE). However, there will be no instructor Q&A after 5pm Friday, so Shang can party and Rich can make progress on clearing his email backlog.

Part 0: List your team members here

Team members:

(name goes here)
(name goes here)

Part 1: JSON to SQLite [20 points]

The learning goal in Part 1 is to reinforce the basic lessons on data conversion and SQL, by having you preprocess some "raw" data, turning it into a SQLite database, and then running some queries on it.

Hint: If you inspect the Yelp! Academic Dataset, you will see that each file is a sequence of JSON records, with one JSON record per line. So, you will most likely want to read the relevant input file line-by-line and process each line as a JSON record.



In [ ]:

    
# As you complete Part 1, place any additional imports you
# need in this code cell.

import json
import sqlite3 as db
import pandas
from IPython.display import display
import string



In [ ]:

    
# A little helper function you can use to quickly inspect tables:
def peek_table (db, name):
    """
    Given a database connection (`db`), prints both the number of
    records in the table as well as its first few entries.
    """
    count = '''SELECT COUNT (*) FROM {table}'''.format (table=name)
    display (pandas.read_sql_query (count, db))
    peek = '''SELECT * FROM {table} LIMIT 5'''.format (table=name)
    display (pandas.read_sql_query (peek, db))



In [ ]:

    
# By way of reminder, here's how you open a connection to a database
# and request a cursor for executing queries.

db_conn = db.connect ('yelp-rest.db')
db_cursor = db_conn.cursor ()

Task 1(a). [5 points] From the Yelp! Academic Dataset, create an SQLite database called, yelp-rest.db, which contains the subset of the data pertaining to restaurants.

In particular, start by creating a table called Restaurants. This table should have the following columns: the business ID (call it Id), restaurant name (Name), city (City), state (State), coordinates (two columns, called Lat and Long), and a semicolon-separated string of the restaurant's categories (Cats).

Note: This table should only contain businesses that are categorized as restaurants.

Hint: When performing large numbers of inserts into the database, it may be helpful to execute db_conn.commit() to save the results before proceeding.



In [ ]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.



In [ ]:

    
# Quickly inspect your handiwork
peek_table (db_conn, "Restaurants")

Task 1(b). [5 points] Next, create a table called Reviews, which contains only reviews of restaurants.

This table should have the following columns: the restaurant's business ID (call it BizId), the reviewer's ID (RevId), the numerical rating (Stars), the date (Date), and the number of up-votes of the review itself (three columns: Useful, Funny, and Cool).

Note: This table should only contain the subset of reviews that pertain to restaurants. You may find your results from Task 1(a) helpful here!



In [ ]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.



In [ ]:

    
peek_table (db_conn, "Reviews")

Task 1(c). [5 points] Next, create a table called Users, which contains all users.

This table should have the following columns: the user's ID (Id), name (Name), and number of fans (NumFans).

Note: This table should contain all users, not just the ones that reviewed restaurants!



In [ ]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.



In [ ]:

    
peek_table (db_conn, "Users")

Task 1(d). [5 points] Create a table, UserEdges, that stores the connectivity graph between users.

This table should have two columns, one for the source vertex (named Source) and one for the target vertex (named Target). Treat the graph as undirected: that is, if there is a link from a user $u$ to a user $v$, then the table should contain both edges $u \rightarrow v$ and $v \rightarrow u$.



In [ ]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.



In [ ]:

    
peek_table (db_conn, "UserEdges")

Part 2: Summary statistics [20 points]

Task 2(a). [2 point] Compute the average rating (measured in "stars"), taken over all reviews.



In [1]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.

Task 2(b). [5 points] For each distinct state, compute the number of reviews and the average restaurant rating (in "stars"). You may ignore businesses that have no reviews. Store these in a dataframe variable, df, with three columns: one column for the state (named State), one column for the number of reviews (NumRevs), and one column for the average rating (named AvgStars). The rows of the df should be sorted in descending order by number of reviews.



In [ ]:

    
# ... Your code to compute `df` goes here ...



In [ ]:

    
display (df)

Task 2(c). [3 points] On average, how many reviews does each user write? You may ignore users who write no reviews.

Write Python code to answer this question in the code cell below, and enter your answer below, rounded to the nearest tenth. For instance, you would enter "5.2" if your program computes "5.24384". (type your answer here)



In [ ]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.

Task 2(c). [5 points] On average, how many friends does each user have? In computing the average, include users who have no friends.

Write Python code to answer this question in the code cell below, and enter your answer here, rounded to the nearest integer: (your answer)

Hint: There is at least one relatively simple way that combines left (outer) joins and the IFNULL(...) function. Although we haven't covered these in class, you should be comfortable enough that you can read about them independently and apply them.



In [ ]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.

Task 2(d). [5 points] Use Seaborn or Plotly to create a histogram of ratings in the state of Nevada.



In [ ]:

    
# Your code goes here. Feel free to use additional code cells
# to break up your work into easily testable chunks.

Part 3: Rank the Restaurants [20 points]

Come up with your own scheme, inspired by either the A-priori association mining algorithm or the PageRank algorithm, to compute a global ranking of the restaurants.

Only consider restaurants with 25 reviews or more.

Explain your scheme and compare its top rated results against a baseline scheme that simply returns restaurants in descending order based on average rating.



In [ ]: