Reviews Datasets Analysis

In this notebook we are going to analyze different reviews datasets to determine how complete they and hence see if they are useful in testing purposes of the implemented recommender systems.

Fourcity TripAdvisor Dataset

This dataset contains 878561 reviews (1.3GB) of 4333 hotels crawled from TripAdvisor.

The dataset is cleaned by doing the following:

The reviews with empty users are removed.
The reviews that contains missing ratings (either overall or multicriteria) are removed.
The users that have less than 10 reviews are removed.
The items that have less than 20 reviews are removed.



In [1]:

    
import sys
sys.path.append('/Users/fpena/UCC/Thesis/projects/yelp/source/python')
from tripadvisor.fourcity import extractor
from etl import ETLUtils

def clean_reviews(reviews):
    """
    Returns a copy of the original reviews list with only that are useful for
    recommendation purposes

    :param reviews: a list of reviews
    :return: a copy of the original reviews list with only that are useful for
    recommendation purposes
    """
    filtered_reviews = extractor.remove_empty_user_reviews(reviews)
    filtered_reviews = extractor.remove_missing_ratings_reviews(filtered_reviews)
    print('Finished remove_missing_ratings_reviews')
    filtered_reviews = extractor.remove_users_with_low_reviews(filtered_reviews, 10)
    print('Finished remove_users_with_low_reviews')
    filtered_reviews = extractor.remove_items_with_low_reviews(filtered_reviews, 20)
    print('Finished remove_single_review_hotels')
    # filtered_reviews = remove_users_with_low_reviews(filtered_reviews, 10)
    # print('Finished remove_users_with_low_reviews')
    print('Number of reviews', len(filtered_reviews))
    return filtered_reviews

def pre_process_reviews():
    """
    Returns a list of preprocessed reviews, where the reviews have been filtered
    to obtain only relevant data, have dropped any fields that are not useful,
    and also have additional fields that are handy to make calculations

    :return: a list of preprocessed reviews
    """
    data_folder = '/Users/fpena/UCC/Thesis/datasets/TripAdvisor/Four-City/'
    review_file_path = data_folder + 'review.txt'
    # review_file_path = data_folder + 'review-short.json'
    reviews = ETLUtils.load_json_file(review_file_path)

    select_fields = ['ratings', 'author', 'offering_id']
    reviews = ETLUtils.select_fields(select_fields, reviews)
    extractor.extract_fields(reviews)
    ETLUtils.drop_fields(['author', 'ratings'], reviews)
    # reviews = load_json_file('/Users/fpena/tmp/filtered_reviews.json')
    # reviews = preflib_extractor.load_csv_file('/Users/fpena/UCC/Thesis/datasets/TripAdvisor/PrefLib/trip/CD-00001-00000001-copy.dat')
    reviews = clean_reviews(reviews)

    return reviews

That leaves a dataset that contains a total of 2954 reviews consisting of 792 users and 105 items



In [2]:

    
fc_reviews = pre_process_reviews()
user_ids = extractor.get_groupby_list(fc_reviews, 'user_id')
item_ids = extractor.get_groupby_list(fc_reviews, 'offering_id')
print('Number of reviews', len(fc_reviews))
print('Number of users', len(user_ids))
print('Number of items', len(item_ids))









    



Finished remove_missing_ratings_reviews
Number of users: 888
Number of reviews: 11898
Finished remove_users_with_low_reviews
Number of items: 105
Number of reviews: 2954
Finished remove_single_review_hotels
('Number of reviews', 2954)
('Number of reviews', 2954)
('Number of users', 792)
('Number of items', 105)

We now proceed to calculate the sparsity of the dataset



In [3]:

    
# Calculate the sparsity
from tripadvisor.reviews_dataset_analyzer import ReviewsDatasetAnalyzer

rda = ReviewsDatasetAnalyzer(fc_reviews)
sparsity = rda.calculate_sparsity()

print('Sparsity', sparsity)









    



('Sparsity', 0.9661135161135161)

Finally we are going to see count the number of items that users have in common with other users



In [4]:

    
# We count the number of items each user has in common with every other user
common_items_count = rda.count_items_in_common()
print(common_items_count)

# We calculate the cumulative percentage of the above counts
rda.analyze_common_items_count(common_items_count, True)









    



{0: 277970, 1: 31899, 2: 2984, 3: 319, 4: 51, 5: 9, 6: 3, 8: 1}






    Out[4]:





{0: 0.8874139626352016,
 1: 0.9892509162420666,
 2: 0.998777279750731,
 3: 0.9997956812116104,
 4: 0.9999584977461083,
 5: 0.9999872300757257,
 6: 0.9999968075189314,
 8: 1.0}

As we can see above, the datased is very sparse, 96.6% of the ratings are missing. We can see that the dataset is very poor, specially for collaborative filtering purposes when we calculate the cumulative percentage of the number of items each user has in common with every other user. As shown above, only 1% of the users have two or more items reviewed in common with other users. This is a very low figure if we want to use collaborative filtering, for two main reasons:

For 89% of the time, the similarity between two users cannot be calculated because they don't have any rated items in common.
For 10% of the time, the similarity has to be calculated using the rating of just one item.
For the rest of times (1%) the similarity between two users can be calculated based on two or more item reviews.

MovieLens 100k dataset

This dataset contains 100000 movie reviews. It contains reviews from 943 users to 1682 items. The sparsity of this dataset is 93.69%.



In [5]:

    
from tripadvisor.fourcity import movielens_extractor

ml_reviews = movielens_extractor.get_ml_100K_dataset()
user_ids = extractor.get_groupby_list(ml_reviews, 'user_id')
item_ids = extractor.get_groupby_list(ml_reviews, 'offering_id')
print('Number of reviews', len(ml_reviews))
print('Number of users', len(user_ids))
print('Number of items', len(item_ids))









    



('Number of reviews', 100000)
('Number of users', 943)
('Number of items', 1682)



In [6]:

    
rda = ReviewsDatasetAnalyzer(ml_reviews)
sparsity = rda.calculate_sparsity()
print('Sparsity', sparsity)









    



('Sparsity', 0.9369533063577546)

As we can see, the sparsity of the MovieLens dataset is smaller (93%) but this doesn't tell us a lot about how good is the dataset for collaborative filtering puroposes. To analyze the quality of the dataset we will have to count the number of items each user has in common with every other user.



In [7]:

    
common_items_count = rda.count_items_in_common()
print(common_items_count)

rda.analyze_common_items_count(common_items_count, True)









    



{0: 15043, 1: 22359, 2: 26088, 3: 27027, 4: 25725, 5: 24279, 6: 22423, 7: 20617, 8: 18598, 9: 16941, 10: 15307, 11: 14062, 12: 12483, 13: 11413, 14: 10029, 15: 9148, 16: 8215, 17: 7668, 18: 6753, 19: 6410, 20: 5800, 21: 5370, 22: 4925, 23: 4716, 24: 4309, 25: 4069, 26: 3766, 27: 3491, 28: 3296, 29: 3059, 30: 2973, 31: 2874, 32: 2764, 33: 2575, 34: 2372, 35: 2277, 36: 2227, 37: 2113, 38: 1930, 39: 1838, 40: 1814, 41: 1768, 42: 1720, 43: 1562, 44: 1450, 45: 1438, 46: 1443, 47: 1379, 48: 1311, 49: 1227, 50: 1247, 51: 1227, 52: 1171, 53: 1093, 54: 1075, 55: 958, 56: 971, 57: 997, 58: 887, 59: 872, 60: 920, 61: 870, 62: 846, 63: 798, 64: 780, 65: 795, 66: 697, 67: 728, 68: 693, 69: 666, 70: 678, 71: 679, 72: 616, 73: 567, 74: 596, 75: 535, 76: 565, 77: 547, 78: 524, 79: 567, 80: 524, 81: 484, 82: 467, 83: 467, 84: 466, 85: 417, 86: 407, 87: 372, 88: 395, 89: 376, 90: 369, 91: 376, 92: 381, 93: 345, 94: 361, 95: 333, 96: 338, 97: 277, 98: 300, 99: 299, 100: 279, 101: 268, 102: 275, 103: 266, 104: 244, 105: 227, 106: 250, 107: 216, 108: 244, 109: 215, 110: 199, 111: 214, 112: 170, 113: 187, 114: 188, 115: 195, 116: 182, 117: 185, 118: 145, 119: 149, 120: 140, 121: 144, 122: 135, 123: 153, 124: 154, 125: 135, 126: 144, 127: 137, 128: 132, 129: 116, 130: 121, 131: 110, 132: 133, 133: 110, 134: 107, 135: 114, 136: 106, 137: 75, 138: 94, 139: 103, 140: 82, 141: 90, 142: 77, 143: 65, 144: 88, 145: 71, 146: 65, 147: 79, 148: 58, 149: 69, 150: 56, 151: 70, 152: 73, 153: 60, 154: 56, 155: 60, 156: 61, 157: 61, 158: 52, 159: 52, 160: 53, 161: 44, 162: 51, 163: 60, 164: 50, 165: 42, 166: 45, 167: 35, 168: 35, 169: 30, 170: 32, 171: 30, 172: 26, 173: 49, 174: 26, 175: 27, 176: 27, 177: 35, 178: 24, 179: 39, 180: 16, 181: 27, 182: 29, 183: 30, 184: 23, 185: 29, 186: 22, 187: 24, 188: 29, 189: 14, 190: 31, 191: 18, 192: 14, 193: 18, 194: 8, 195: 11, 196: 18, 197: 15, 198: 18, 199: 15, 200: 16, 201: 18, 202: 17, 203: 18, 204: 12, 205: 15, 206: 19, 207: 9, 208: 11, 209: 15, 210: 14, 211: 12, 212: 14, 213: 10, 214: 7, 215: 8, 216: 9, 217: 5, 218: 6, 219: 9, 220: 4, 221: 8, 222: 11, 223: 9, 224: 8, 225: 14, 226: 5, 227: 6, 228: 5, 229: 4, 230: 3, 231: 6, 232: 12, 233: 6, 234: 9, 235: 5, 236: 8, 237: 6, 238: 1, 239: 3, 240: 2, 241: 4, 242: 5, 243: 7, 244: 5, 245: 3, 246: 1, 247: 2, 248: 7, 249: 3, 250: 4, 251: 2, 252: 2, 253: 3, 254: 3, 255: 4, 256: 4, 257: 7, 258: 3, 259: 3, 260: 2, 261: 2, 262: 2, 263: 4, 264: 3, 265: 2, 266: 2, 267: 1, 268: 3, 269: 2, 271: 1, 273: 1, 274: 1, 276: 1, 277: 1, 278: 2, 279: 2, 281: 2, 282: 2, 283: 1, 285: 2, 287: 2, 288: 1, 290: 1, 292: 2, 293: 1, 294: 1, 299: 1, 300: 1, 301: 1, 303: 1, 305: 1, 311: 1, 314: 1, 315: 1, 318: 1, 319: 1, 327: 2, 332: 2, 334: 1, 335: 1, 346: 1}






    Out[7]:





{0: 0.03386895957023819,
 1: 0.08420972052423377,
 2: 0.1429462369949094,
 3: 0.20379688981049324,
 4: 0.26171612034591685,
 5: 0.3163797159987662,
 6: 0.36686457144272355,
 7: 0.4132832604980716,
 8: 0.45515621869040623,
 9: 0.49329848047857383,
 10: 0.5277618298199044,
 11: 0.559422091036197,
 12: 0.5875272710079635,
 13: 0.6132233712256813,
 14: 0.6358034280979753,
 15: 0.656399934256889,
 16: 0.6748958129293284,
 17: 0.6921601340078757,
 18: 0.7073643541752505,
 19: 0.7217963179354863,
 20: 0.7348548810882737,
 21: 0.7469453093866303,
 22: 0.7580338306844714,
 23: 0.768651793413531,
 24: 0.7783534052454898,
 25: 0.7875146627400919,
 26: 0.7959937228837811,
 27: 0.8038536270159158,
 28: 0.8112744932489481,
 29: 0.8181617595738406,
 30: 0.824855398927847,
 31: 0.8313261421176937,
 32: 0.8375492229029187,
 33: 0.8433467746474752,
 34: 0.8486872766816841,
 35: 0.853813888457356,
 36: 0.8588279264127452,
 37: 0.86358529605789,
 38: 0.8679306455208003,
 39: 0.8720688591543906,
 40: 0.8761530373542451,
 41: 0.8801336476394396,
 42: 0.8840061870571627,
 43: 0.8875229932027927,
 44: 0.8907876339909896,
 45: 0.8940252570623186,
 46: 0.8972741375156759,
 47: 0.9003789234790713,
 48: 0.9033306090468824,
 49: 0.9060931705966186,
 50: 0.908900761674468,
 51: 0.9116633232242042,
 52: 0.9142998020952239,
 53: 0.9167606658066026,
 54: 0.9191810029426797,
 55: 0.9213379173392953,
 56: 0.9235241009291844,
 57: 0.9257688229056203,
 58: 0.9277658824774345,
 59: 0.9297291699031639,
 60: 0.9318005281963647,
 61: 0.9337593126692828,
 62: 0.9356640617084653,
 63: 0.9374607398801764,
 64: 0.9392168914765857,
 65: 0.9410068152190798,
 66: 0.9425760942738199,
 67: 0.9442151690971353,
 68: 0.9457754422462529,
 69: 0.9472749255324178,
 70: 0.9488014265354505,
 71: 0.9503301790148889,
 72: 0.9517170884807712,
 73: 0.9529936756027765,
 74: 0.9543355555405457,
 75: 0.9555400954175701,
 76: 0.9568121795867641,
 77: 0.9580437371806563,
 78: 0.9592235108172185,
 79: 0.9605000979392238,
 80: 0.9616798715757859,
 81: 0.962769586156122,
 82: 0.9638210256375619,
 83: 0.9648724651190018,
 84: 0.9659216531240361,
 85: 0.9668605187851934,
 86: 0.9677768696822941,
 87: 0.9686144189051971,
 88: 0.96950375208543,
 89: 0.9703503072139555,
 90: 0.9711811020076415,
 91: 0.972027657136167,
 92: 0.9728854696467208,
 93: 0.9736622290066711,
 94: 0.9744750119891118,
 95: 0.9752247536321943,
 96: 0.975985752657305,
 97: 0.9766094116216709,
 98: 0.9772848545433668,
 99: 0.977958045988657,
 100: 0.9785862079058342,
 101: 0.9791896035825491,
 102: 0.9798087595941037,
 103: 0.9804076523180074,
 104: 0.9809570125609868,
 105: 0.98146809770507,
 106: 0.9820309668064833,
 107: 0.9825172857101043,
 108: 0.9830666459530837,
 109: 0.9835507133802991,
 110: 0.983998757185024,
 111: 0.9844805731358338,
 112: 0.9848633241247948,
 113: 0.9852843502126519,
 114: 0.9857076277769147,
 115: 0.986146665676017,
 116: 0.9865564343818459,
 117: 0.9869729575168917,
 118: 0.9872994215957114,
 119: 0.9876348915801537,
 120: 0.9879500982769451,
 121: 0.9882743108793591,
 122: 0.9885782601941223,
 123: 0.9889227360841872,
 124: 0.9892694634506578,
 125: 0.989573412765421,
 126: 0.989897625367835,
 127: 0.9902060776354095,
 128: 0.9905032725209557,
 129: 0.9907644437840115,
 130: 0.9910368724290955,
 131: 0.9912845348337174,
 132: 0.9915839811956693,
 133: 0.9918316436002911,
 134: 0.992072551575696,
 135: 0.9923292198859404,
 136: 0.9925678763849396,
 137: 0.9927367371153636,
 138: 0.992948375897495,
 139: 0.9931802779672773,
 140: 0.9933648990325408,
 141: 0.9935675319090496,
 142: 0.9937408955922848,
 143: 0.9938872415586523,
 144: 0.9940853714823498,
 145: 0.9942452263071512,
 146: 0.9943915722735186,
 147: 0.9945694389095652,
 148: 0.9947000245410931,
 149: 0.9948553764130831,
 150: 0.9949814590917997,
 151: 0.9951390624401955,
 152: 0.9953034202178082,
 153: 0.9954385088021473,
 154: 0.995564591480864,
 155: 0.9956996800652032,
 156: 0.995837020125948,
 157: 0.9959743601866928,
 158: 0.9960914369597867,
 159: 0.9962085137328807,
 160: 0.9963278419823802,
 161: 0.996426906944229,
 162: 0.9965417322409172,
 163: 0.9966768208252564,
 164: 0.9967893946455391,
 165: 0.9968839566545765,
 166: 0.9969852730928309,
 167: 0.9970640747670287,
 168: 0.9971428764412266,
 169: 0.9972104207333962,
 170: 0.997282467978377,
 171: 0.9973500122705466,
 172: 0.9974085506570937,
 173: 0.9975188730009706,
 174: 0.9975774113875177,
 175: 0.9976382012504703,
 176: 0.997698991113423,
 177: 0.9977777927876208,
 178: 0.9978318282213565,
 179: 0.997919635801177,
 180: 0.9979556594236675,
 181: 0.9980164492866201,
 182: 0.9980817421023841,
 183: 0.9981492863945537,
 184: 0.9982010703518837,
 185: 0.9982663631676476,
 186: 0.9983158956485719,
 187: 0.9983699310823076,
 188: 0.9984352238980716,
 189: 0.9984667445677508,
 190: 0.998536540336326,
 191: 0.9985770669116277,
 192: 0.9986085875813069,
 193: 0.9986491141566086,
 194: 0.9986671259678539,
 195: 0.9986918922083161,
 196: 0.9987324187836178,
 197: 0.9987661909297026,
 198: 0.9988067175050044,
 199: 0.9988404896510892,
 200: 0.9988765132735796,
 201: 0.9989170398488814,
 202: 0.9989553149477775,
 203: 0.9989958415230792,
 204: 0.999022859239947,
 205: 0.9990566313860318,
 206: 0.9990994094377392,
 207: 0.9991196727253902,
 208: 0.9991444389658524,
 209: 0.9991782111119372,
 210: 0.9992097317816163,
 211: 0.9992367494984842,
 212: 0.9992682701681633,
 213: 0.9992907849322199,
 214: 0.9993065452670594,
 215: 0.9993245570783047,
 216: 0.9993448203659556,
 217: 0.9993560777479839,
 218: 0.9993695866064177,
 219: 0.9993898498940686,
 220: 0.9993988557996912,
 221: 0.9994168676109365,
 222: 0.9994416338513987,
 223: 0.9994618971390496,
 224: 0.9994799089502949,
 225: 0.9995114296199741,
 226: 0.9995226870020023,
 227: 0.9995361958604362,
 228: 0.9995474532424644,
 229: 0.999556459148087,
 230: 0.9995632135773039,
 231: 0.9995767224357378,
 232: 0.9996037401526057,
 233: 0.9996172490110395,
 234: 0.9996375122986905,
 235: 0.9996487696807187,
 236: 0.999666781491964,
 237: 0.9996802903503978,
 238: 0.9996825418268035,
 239: 0.9996892962560204,
 240: 0.9996937992088317,
 241: 0.9997028051144543,
 242: 0.9997140624964825,
 243: 0.9997298228313221,
 244: 0.9997410802133503,
 245: 0.9997478346425672,
 246: 0.9997500861189729,
 247: 0.9997545890717842,
 248: 0.9997703494066237,
 249: 0.9997771038358406,
 250: 0.9997861097414632,
 251: 0.9997906126942745,
 252: 0.9997951156470858,
 253: 0.9998018700763027,
 254: 0.9998086245055197,
 255: 0.9998176304111422,
 256: 0.9998266363167648,
 257: 0.9998423966516043,
 258: 0.9998491510808213,
 259: 0.9998559055100382,
 260: 0.9998604084628495,
 261: 0.9998649114156608,
 262: 0.9998694143684721,
 263: 0.9998784202740947,
 264: 0.9998851747033116,
 265: 0.9998896776561229,
 266: 0.9998941806089342,
 267: 0.9998964320853398,
 268: 0.9999031865145568,
 269: 0.9999076894673681,
 271: 0.9999099409437737,
 273: 0.9999121924201794,
 274: 0.999914443896585,
 276: 0.9999166953729907,
 277: 0.9999189468493963,
 278: 0.9999234498022076,
 279: 0.9999279527550189,
 281: 0.9999324557078302,
 282: 0.9999369586606415,
 283: 0.9999392101370471,
 285: 0.9999437130898584,
 287: 0.9999482160426697,
 288: 0.9999504675190753,
 290: 0.999952718995481,
 292: 0.9999572219482923,
 293: 0.9999594734246979,
 294: 0.9999617249011036,
 299: 0.9999639763775092,
 300: 0.9999662278539149,
 301: 0.9999684793303205,
 303: 0.9999707308067262,
 305: 0.9999729822831318,
 311: 0.9999752337595375,
 314: 0.9999774852359431,
 315: 0.9999797367123487,
 318: 0.9999819881887544,
 319: 0.99998423966516,
 327: 0.9999887426179713,
 332: 0.9999932455707826,
 334: 0.9999954970471883,
 335: 0.9999977485235939,
 346: 0.9999999999999996}

With this dataset we have better results, 86% of the users have two or more reviews in common with other users, and 50% of the users have 10 or more reviews in common with other users. This is very good if we are measuring the similarities between users based on the rating they have in common.