Yelp Phoenix Dataset Analysis

Fact Sheet

The Yelp Phoenix contains:

  • 335022 reviews
  • Made by 70817 users
  • About 15579 items
  • It has an approximated sparsity of 0.999696333961

Now we are going to analyze the number of reviews per user and per item


In [ ]:
import sys
sys.path.append('/Users/fpena/UCC/Thesis/projects/yelp/source/python')
from etl import ETLUtils

from etl.reviews_dataset_analyzer import ReviewsDatasetAnalyzer

# Load reviews
file_path = '/Users/fpena/UCC/Thesis/datasets/yelp_phoenix_academic_dataset/filtered_reviews.json'
reviews = ETLUtils.load_json_file(file_path)

rda = ReviewsDatasetAnalyzer(reviews)

Users Reviews Analysis

  • The average number of reviews per user is 4.73081322281
  • The minimum number of reviews a user has is 1
  • The maximum number of reviews a user has is 774

In [ ]:
# Number of reviews per user
users_summary = rda.summarize_reviews_by_field('user_id')
print('Average number of reviews per user', float(rda.num_reviews)/rda.num_users)
users_summary.plot(kind='line', rot=0)

Items Reviews Analysis

  • The average number of reviews per item is 21.5047178895
  • The minimum number of reviews an item has is 1
  • The maximum number of reviews an item has is 1065

In [ ]:
# Number of reviews per item
items_summary = rda.summarize_reviews_by_field('offering_id')
print('Average number of reviews per item', float(rda.num_reviews)/rda.num_items)
items_summary.plot(kind='line', rot=0)

Number of items 2 users have in common

In this section we are going to count the number of items two users have in common


In [ ]:
# Number of items 2 users have in common
common_item_counts = rda.count_items_in_common()
plt.plot(common_item_counts.keys(), common_item_counts.values())

In [ ]:
from pylab import boxplot
my_data = [key for key, value in common_item_counts.iteritems() for i in xrange(value)]
mean_common_items = float(sum(my_data))/len(my_data)
print('Average number of common items between two users:', mean_common_items)
boxplot(my_data)