In this notebook we are going to analyze different reviews datasets to determine how complete they and hence see if they are useful in testing purposes of the implemented recommender systems.
This dataset contains 878561 reviews (1.3GB) of 4333 hotels crawled from TripAdvisor.
The dataset is cleaned by doing the following:
In [1]:
import sys
sys.path.append('/Users/fpena/UCC/Thesis/projects/yelp/source/python')
from tripadvisor.fourcity import extractor
from etl import ETLUtils
def clean_reviews(reviews):
"""
Returns a copy of the original reviews list with only that are useful for
recommendation purposes
:param reviews: a list of reviews
:return: a copy of the original reviews list with only that are useful for
recommendation purposes
"""
filtered_reviews = extractor.remove_empty_user_reviews(reviews)
filtered_reviews = extractor.remove_missing_ratings_reviews(filtered_reviews)
print('Finished remove_missing_ratings_reviews')
filtered_reviews = extractor.remove_users_with_low_reviews(filtered_reviews, 10)
print('Finished remove_users_with_low_reviews')
filtered_reviews = extractor.remove_items_with_low_reviews(filtered_reviews, 20)
print('Finished remove_single_review_hotels')
# filtered_reviews = remove_users_with_low_reviews(filtered_reviews, 10)
# print('Finished remove_users_with_low_reviews')
print('Number of reviews', len(filtered_reviews))
return filtered_reviews
def pre_process_reviews():
"""
Returns a list of preprocessed reviews, where the reviews have been filtered
to obtain only relevant data, have dropped any fields that are not useful,
and also have additional fields that are handy to make calculations
:return: a list of preprocessed reviews
"""
data_folder = '/Users/fpena/UCC/Thesis/datasets/TripAdvisor/Four-City/'
review_file_path = data_folder + 'review.txt'
# review_file_path = data_folder + 'review-short.json'
reviews = ETLUtils.load_json_file(review_file_path)
select_fields = ['ratings', 'author', 'offering_id']
reviews = ETLUtils.select_fields(select_fields, reviews)
extractor.extract_fields(reviews)
ETLUtils.drop_fields(['author', 'ratings'], reviews)
# reviews = load_json_file('/Users/fpena/tmp/filtered_reviews.json')
# reviews = preflib_extractor.load_csv_file('/Users/fpena/UCC/Thesis/datasets/TripAdvisor/PrefLib/trip/CD-00001-00000001-copy.dat')
reviews = clean_reviews(reviews)
return reviews
That leaves a dataset that contains a total of 2954 reviews consisting of 792 users and 105 items
In [2]:
fc_reviews = pre_process_reviews()
user_ids = extractor.get_groupby_list(fc_reviews, 'user_id')
item_ids = extractor.get_groupby_list(fc_reviews, 'offering_id')
print('Number of reviews', len(fc_reviews))
print('Number of users', len(user_ids))
print('Number of items', len(item_ids))
We now proceed to calculate the sparsity of the dataset
In [3]:
# Calculate the sparsity
from tripadvisor.reviews_dataset_analyzer import ReviewsDatasetAnalyzer
rda = ReviewsDatasetAnalyzer(fc_reviews)
sparsity = rda.calculate_sparsity()
print('Sparsity', sparsity)
Finally we are going to see count the number of items that users have in common with other users
In [4]:
# We count the number of items each user has in common with every other user
common_items_count = rda.count_items_in_common()
print(common_items_count)
# We calculate the cumulative percentage of the above counts
rda.analyze_common_items_count(common_items_count, True)
Out[4]:
As we can see above, the datased is very sparse, 96.6% of the ratings are missing. We can see that the dataset is very poor, specially for collaborative filtering purposes when we calculate the cumulative percentage of the number of items each user has in common with every other user. As shown above, only 1% of the users have two or more items reviewed in common with other users. This is a very low figure if we want to use collaborative filtering, for two main reasons:
In [5]:
from tripadvisor.fourcity import movielens_extractor
ml_reviews = movielens_extractor.get_ml_100K_dataset()
user_ids = extractor.get_groupby_list(ml_reviews, 'user_id')
item_ids = extractor.get_groupby_list(ml_reviews, 'offering_id')
print('Number of reviews', len(ml_reviews))
print('Number of users', len(user_ids))
print('Number of items', len(item_ids))
In [6]:
rda = ReviewsDatasetAnalyzer(ml_reviews)
sparsity = rda.calculate_sparsity()
print('Sparsity', sparsity)
As we can see, the sparsity of the MovieLens dataset is smaller (93%) but this doesn't tell us a lot about how good is the dataset for collaborative filtering puroposes. To analyze the quality of the dataset we will have to count the number of items each user has in common with every other user.
In [7]:
common_items_count = rda.count_items_in_common()
print(common_items_count)
rda.analyze_common_items_count(common_items_count, True)
Out[7]:
With this dataset we have better results, 86% of the users have two or more reviews in common with other users, and 50% of the users have 10 or more reviews in common with other users. This is very good if we are measuring the similarities between users based on the rating they have in common.