TripAdvisor Datasets

Here are four datasets that contain reviews scraped from the TripAdvisor website, all of them cotain text but some do not contain the ratings. The details of each data set are presented below.

Four-City Dataset

This dataset consists of 878561 reviews (1.3GB) from 4333 hotels crawled from TripAdvisor.

Context

This data set is used in a research project that aims to detect fake hotel reviews.

Number of hotels

4333

Number of reviews

878561

Format

JSON

Structure

  • root
    • ratings
      • overall
      • cleanliness
      • location
      • rooms
      • service
      • sleep_quality
      • value
    • title
    • text
    • author
      • username
      • num_cities
      • num_helpful_votes
      • num_reviews
      • num_type_reviews
      • id
      • location
    • date_stayed
    • offering_id
    • num_helpful_votes
    • date
    • id
    • via_mobile

Snippet

{"ratings": {"service": 5.0, "cleanliness": 5.0, "overall": 5.0, "value": 5.0, " location": 5.0, "sleep_quality": 5.0, "rooms": 5.0}, "title": "\u201cTruly is \" Jewel of the Upper Wets Side\"\u201d", "text": "Stayed in a king suite for 11 ni ghts and yes it cots us a bit but we were happy with the standard of room, the l ocation and the friendliness of the staff. Our room was on the 20th floor overlo oking Broadway and the madhouse of the Fairway Market. Room was quite with no no ise evident from the hallway or adjoining rooms. It was great to be able to open windows when we craved fresh rather than heated air. The beds, including the fo ld out sofa bed, were comfortable and the rooms were cleaned well. Wi-fi access worked like a dream with only one connectivity issue on our first night and this was promptly responded to with a call from the service provider to ensure that all was well. The location close to the 72nd Street subway station is great and the complimentary umbrellas on the drizzly days were greatly appreciated. It is fabulous to have the kitchen with cooking facilities and the access to a whole r ange of fresh foods directly across the road at Fairway.\nThis is the second tim e that members of the party have stayed at the Beacon and it will certainly be o ur hotel of choice for future visits.", "author": {"username": "Papa_Panda", "nu m_cities": 22, "num_helpful_votes": 12, "num_reviews": 29, "num_type_reviews": 2 4, "id": "8C0B42FF3C0FA366A21CFD785302A032", "location": "Gold Coast"}, "date_st ayed": "December 2012", "offering_id": 93338, "num_helpful_votes": 0, "date": "D ecember 17, 2012", "id": 147643103, "via_mobile": false}

URL

http://www.cs.cmu.edu/~jiweil/html/hotel-review.html

http://www.cs.cmu.edu/~jiweil/html/four_city.html

Notes

  • Some of the reviews are written in French.
  • If a rating is not available then it doesn't appear (unlike other dataset that set the missing rating a value of -1)

OpinRank Dataset

This dataset contains full reviews of hotels in 10 different cities (Dubai, Beijing, London, New York City, New Delhi, San Francisco, Shanghai, Montreal, Las Vegas, Chicago). There are about 80-700 hotels in each city. The extracted fields include date, review title and the full review. The total number of reviews is arounf 259,000.

Context

This data set is used in a research project that aims to extract entities (features) from reviews and aims to rank those entities according to the user's preferences.

Number of hotels

2579

Number of reviews

~259,000

Format

CSV (Tab separated)

Structure

In the data folder, there should be 10 different sub-folders representing the 10 cities mentioned earlier. Each file (within these 10 folders) would contain all reviews related to a particular hotel. You can ignore all the csv files in the data folder. The filename represents the name of the hotel. Within each file, you would see a set of reviews in the following format:

Date \t Title \t Review

Each line in the file represents a separate review entry. Tabs are used to separate the different fields.

Snippet

Nov 1 2009 Perfect location We finished our honeymoon in London and, being the third Hilton we stayed in during 12 days, I must say that it was the lowest point during our "Hilton Honeymoon". That doesn't mean that the hotel was not good. It was great for other reasons. I'm only saying that the others were better. Now, the good things: Perfect location. 5 blocks away from Paddington Station (priceless if you arrive to or depart from Heathrow since you can catch the Heathrow Express/Connect). You have 3 underground lines in front of the hotel (Edgware Road Station) and it's only 10-minute walk from Hyde Park/Marble Arch/Oxford Street. The room was small but ok, with a great king size bed. The room was in the 10th floor of the Wing Tower so the view was nice. In our case, price was another good thing. We found the hotel through Hotwire and 100 USD per night for a Hilton in London sounded like a bargain. It was definitely a good value for money. We didn't have breakfast included (although complimentary tea/coffee is offered in the room), but we had a Marks & Spencer right in front of the hotel and a Tesco round the corner. Considering that the hotel was full, it wasn't noisy at all. Overall, it was very good at met our expectations.

URL

http://www.kavita-ganesan.com/entity-ranking-data

http://www.kavita-ganesan.com/opinion-based-entity-ranking

http://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset

Notes

  • It contains no ratings.

PrefLib Dataset

This dataset contains 675,069 reviews of 1,851 hotels across the world scraped from Trip Advisor.

One file contains the numerical aspect ratings provided by the users, along with other information about the hotel. The second file contains the text of the users review. These reviews have been slightly modified, all excess spaces and tabs have been removed and all commas have bene changed to semi-colons.

Both files are zipped due to their size. Both files are encoded in the dat format and the first line of each file explains the fields within the file. Some of the usernames are encoded in Unicode.

Context

None, although is used in a couple of papers, but not by the person who collected the dataset.

Number of hotels

1,851

Number of reviews

675,069

Format

CSV (Comma separated separated)

Structure

The data is spread in two files, one containing the ratings and other short information, and the other one the text reviews. The two data files can be joined by using the hotel ID and the user ID.

Fields in file 1

  • Hotel ID
  • User ID
  • Price
  • Location
  • Overall Rating
  • Value Rating
  • Rooms Rating
  • Location Rating
  • Cleanliness Rating
  • Front Desk Rating
  • Service Rating
  • Business Service Rating

Fields in file 2

  • Hotel ID
  • User ID
  • Date
  • Review Text

Snippet

File 1

Hotel ID,User ID,Price,Location,Overall Rating,Value Rating,Rooms Rating,Location Rating,Cleanliness Rating,Front Desk Rating,Service Rating,Business Service Rating 100504,selizabethm,302,Seattle Washington,5,4,5,5,5,5,5,-1

File 2

Hotel ID, User ID,Date,Review Text 100504,selizabethm,12/23/2008,Wonderful time- even with the snow! What a great experience! From the goldfish in the room (which my daughter loved) to the fact that the valet parking staff who put on my chains on for me it was fabulous. The staff was attentive and went above and beyond to make our stay enjoyable. Oh; and about the parking: the charge is about what you would pay at any garage or lot- and I bet they wouldn't help you out in the snow!

URL

http://www.preflib.org/combinatorial/trip.php

Notes

  • When ratings are unavailable, they take the value of -1.

Latent Aspect Rating Analysis Dataset

This dataset includes information about Author, Content, Date, Number of Reader, Number of Helpful Judgment, Overall rating, Value aspect rating, Rooms aspect rating, Location aspect rating, Cleanliness aspect rating, Check in/front desk aspect rating, Service aspect rating and Business Service aspect rating. Ratings ranges from 0 to 5 stars, and -1 indicates this aspect rating is missing in the orginal html file.

Context

None.

Number of hotels

1,851

Number of reviews

Unknown. Since the number of hotels is the same as the PrefLib dataset, is likely that there is the same dataset, hence, it could have 675,069 reviews.

Format

~XML (Semi-XML, doesn't close tags)

Structure

The data is spread in 1,851 files, one per each hotel. Each file contains all the reviews for that hotel.

The header of each file contains the overall rating, the average price and the URL of the hotel. It seems that the indformation about the location is not available.

The available information is:

  • Author
  • Content
  • Date
  • img
  • No. Reader
  • No. Helpful
  • Overall
  • Value
  • Rooms
  • Location
  • Cleanliness
  • Check in / front desk
  • Service
  • Business Service

Snippet

<Author>Mircom
<Content>Excellent Beach - Hotel construction woes An otherwise excellent holiday was initially marred by construction glitches. Upon arrival, we were asked to wait until our room was ready. We were invited to dine at the buffet while waiting only to be told by the staff at the restaurant that the buffet was closed. After speaking with the restaurant manager, we were escorted to the Italian restaurant where we enjoyed a delicious meal. We returned to the Front Desk and Rafael Norburto Aquino, a bellman for the Hotel, took us to our room. The one bedroom suite was gorgeous, however, both the light fixtures in the shower and toilet areas were dangling from the ceiling. There was a box of electrical bits on the floor. Rafael very professionally moved into action and spoke with several people to have the wiring completed. We were impressed with his professionalism and fluency in English which was a life saver because we unfortunately speak little Spanish. He also arranged to have a second bed delivered. We had requested this when booking the hotel, but there appeared to be no record of our request. After a few hours, the suite was finally available for our use. We took a shower only to find that there was a cap on the drain and the room flooded because of the lack of proper drainage. We called the Front Desk three times before someone finally called maintenance. One hour later, a maintenace person came and using a screwdriver, broke a hole in the cap to allow the water to drain. It took another 30 minutes to have someone mop up the excess water. Apparently we were the first guests to use the suite. The wait staff in the restaurants are excellent as is the food. We recommend the Italian and Argentinian restaurants -- and the buffet has an excellent variety of food. In the late afternoon, we would listen to a great piano player who also performs in the French restaurant.We were also impressed with Olmaira, the Manager of Guest Services. We did speak with her regarding our challenges with the hotel and the professionalism of Rafael.Once the construction glitches have been remedied, this will be an excellent hotel.
<Date>Dec 30, 2008
<No. Reader>23
<No. Helpful>23
<Overall>4
<Value>4
<Rooms>2
<Location>5
<Cleanliness>4
<Check in / front desk>2
<Service>4
<Business service>-1

URL

http://sifaka.cs.uiuc.edu/~wang296/Data/index.html

Notes

  • When ratings are unavailable, they take the value of -1.