MovieLens 100k Dataset Analysis

Fact Sheet

The MovieLens 100k contains:

  • 100000 reviews
  • Made by 943 users
  • About 1682 items
  • It has an approximated sparsity of 0.936953306358

Now we are going to analyze the number of reviews per user and per item


In [ ]:
import sys
sys.path.append('/Users/fpena/UCC/Thesis/projects/yelp/source/python')
from etl import ETLUtils

from etl.reviews_dataset_analyzer import ReviewsDatasetAnalyzer

# Load reviews
from tripadvisor.fourcity import movielens_extractor
reviews = movielens_extractor.get_ml_100K_dataset()
rda = ReviewsDatasetAnalyzer(reviews)

Users Reviews Analysis

  • The average number of reviews per user is 106.044538706
  • The minimum number of reviews a user has is 20
  • The maximum number of reviews a user has is 737

In [ ]:
# Number of reviews per user
users_summary = rda.summarize_reviews_by_field('user_id')
print('Average number of reviews per user', float(rda.num_reviews)/rda.num_users)
users_summary.plot(kind='line', rot=0)

Items Reviews Analysis

  • The average number of reviews per item is 59.4530321046
  • The minimum number of reviews an item has is 1
  • The maximum number of reviews an item has is 583

In [ ]:
# Number of reviews per item
items_summary = rda.summarize_reviews_by_field('offering_id')
print('Average number of reviews per item', float(rda.num_reviews)/rda.num_items)
items_summary.plot(kind='line', rot=0)

Number of items 2 users have in common

In this section we are going to count the number of items two users have in common


In [ ]:
# Number of items 2 users have in common
common_item_counts = rda.count_items_in_common()
plt.plot(common_item_counts.keys(), common_item_counts.values())

In [ ]:
from pylab import boxplot
my_data = [key for key, value in common_item_counts.iteritems() for i in xrange(value)]
mean_common_items = float(sum(my_data))/len(my_data)
print('Average number of common items between two users:', mean_common_items)
boxplot(my_data)