Yelp Phoenix Dataset Analysis

Fact Sheet

The Yelp Phoenix contains:

  • 123612 reviews
  • Made by 4199 users
  • About 2205 items
  • It has an approximated sparsity of 0.986649234593

Now we are going to analyze the number of reviews per user and per item


In [6]:
import sys
sys.path.append('/Users/fpena/UCC/Thesis/projects/yelp/source/python')
from etl import ETLUtils

from etl.reviews_dataset_analyzer import ReviewsDatasetAnalyzer

# Load reviews
file_path = '/Users/fpena/UCC/Thesis/datasets/yelp_phoenix_academic_dataset/filtered_reviews.json'
reviews = ETLUtils.load_json_file(file_path)

rda = ReviewsDatasetAnalyzer(reviews)

Users Reviews Analysis

  • The average number of reviews per user is 29.4384377233
  • The minimum number of reviews a user has is 10
  • The maximum number of reviews a user has is 476

In [7]:
# Number of reviews per user
users_summary = rda.summarize_reviews_by_field('user_id')
print('Average number of reviews per user', float(rda.num_reviews)/rda.num_users)
users_summary.plot(kind='line', rot=0)


('Average number of reviews per user', 29.438437723267445)
Out[7]:
<matplotlib.axes.AxesSubplot at 0x11990a5d0>

Items Reviews Analysis

  • The average number of reviews per item is 56.0598639456
  • The minimum number of reviews an item has is 20
  • The maximum number of reviews an item has is 527

In [8]:
# Number of reviews per item
items_summary = rda.summarize_reviews_by_field('offering_id')
print('Average number of reviews per item', float(rda.num_reviews)/rda.num_items)
items_summary.plot(kind='line', rot=0)


('Average number of reviews per item', 56.05986394557823)
Out[8]:
<matplotlib.axes.AxesSubplot at 0x114f85990>

Number of items 2 users have in common

In this section we are going to count the number of items two users have in common


In [9]:
# Number of items 2 users have in common
common_item_counts = rda.count_items_in_common()
plt.plot(common_item_counts.keys(), common_item_counts.values())


Out[9]:
[<matplotlib.lines.Line2D at 0x118d1a050>]

In [10]:
from pylab import boxplot
my_data = [key for key, value in common_item_counts.iteritems() for i in xrange(value)]
mean_common_items = float(sum(my_data))/len(my_data)
print('Average number of common items between two users:', mean_common_items)
boxplot(my_data)


('Average number of common items between two users:', 0.6242510382414833)
Out[10]:
{'boxes': [<matplotlib.lines.Line2D at 0x118ceb350>],
 'caps': [<matplotlib.lines.Line2D at 0x118d1a3d0>,
  <matplotlib.lines.Line2D at 0x118d27110>],
 'fliers': [<matplotlib.lines.Line2D at 0x118cebd50>,
  <matplotlib.lines.Line2D at 0x118d7f250>],
 'medians': [<matplotlib.lines.Line2D at 0x118ceb850>],
 'whiskers': [<matplotlib.lines.Line2D at 0x119978690>,
  <matplotlib.lines.Line2D at 0x136229210>]}