bf_qt_scraping

This notebook describes how hotel data can be scraped using PyQT.

The items we want to extract are:

  • the hotels for a given city
  • links to each hotel page
  • text hotel summary
  • text hotel description

Once the links for each hotel are determined, I then want to extract the following items pertaining to each review:

  • title
  • author
  • text
  • rating

In [1]:
import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html

In [2]:
class Render(QWebPage):  
    def __init__(self, url):  
        self.app = QApplication(sys.argv)  
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()  

    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit() 
    
    def update_url(self, url):
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()

In [68]:
url = 'http://www.bringfido.com/lodging/city/new_haven_ct_us'  
#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()

In [65]:
# result

In [ ]:
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

In [16]:
#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)

In [24]:
tree.text_content


Out[24]:
<bound method HtmlElement.text_content of <Element html at 0x10c181c00>>

In [19]:
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//*[@id="results_list"]/div')
print archive_links


[]

In [2]:
url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

tree = html.fromstring(formatted_result)

In [5]:
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//*[@class="campaign"]/a/@href')

# for lnk in archive_links:
#     print(lnk)

Now the Hotels


In [3]:
url = 'http://www.bringfido.com/lodging/city/new_haven_ct_us'  
r = Render(url)  
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

tree = html.fromstring(formatted_result)

In [4]:
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//*[@id="results_list"]/div')

print(archive_links)
print('')

for lnk in archive_links:
    print(lnk.xpath('div[2]/h1/a/text()')[0])
    print(lnk.text_content())
    print('*'*25)


[<Element div at 0x10adf9788>, <Element div at 0x10adf97e0>, <Element div at 0x10adf9838>, <Element div at 0x10adf9890>, <Element div at 0x10adf98e8>]

La Quinta Inn & Suites New Haven
La Quinta Inn & Suites New HavenNew Haven, CT, USLa Quinta Inn & Suites New Haven is pet friendly! Up to two pets of any size are allowed in each room for no additional fee or deposit.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$75 (no pet fee)VISIT WEBSITE
*************************
Omni New Haven Hotel at Yale
Omni New Haven Hotel at YaleNew Haven, CT, USOmni New Haven Hotel At Yale welcomes a maximum of two dogs, 25lbs or less, per guest room for an additioanl $50 per stay. Dogs over 25lbs require prior approval from the manager. Please note that ...Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$219 + pet feeCHECK RATES
*************************
Premiere Hotel & Suites
Premiere Hotel & SuitesNew Haven, CT, USPremiere Hotel And Suites allows up to two dogs (50 lbs or less) per guest room for an additional fee of $75 per stay. Larger dogs may be permitted with prior management approval.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$110 + pet feeCHECK RATES
*************************
Econo Lodge Conference Center New Haven
Econo Lodge Conference Center New HavenNew Haven, CT, USEcono Lodge Conference Center welcomes up to two pets, 25lbs or less, in a limited number of pet-friendly rooms, for an additional $10 per pet, per night.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$56 + pet feeCHECK RATES
*************************
The Study at Yale
The Study at YaleNew Haven, CT, USThe Study At Yale allows up to two dogs (50 lbs or less) for an additional fee of $50 per pet per stay.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$189 + pet feeCHECK RATES
*************************

In [5]:
links = []
for lnk in archive_links:
    print(lnk.xpath('div/h1/a/@href')[0])
    links.append(lnk.xpath('div/h1/a/@href')[0])
    print('*'*25)


/lodging/70449/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=75.01
*************************
/lodging/70451/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=219
*************************
/lodging/70452/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=109.65
*************************
/lodging/70447/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=55.95
*************************
/lodging/106805/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=189
*************************

In [6]:
lnk.xpath('//*/div/h1/a/@href')[0]


Out[6]:
'/lodging/70449/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=75.01'

In [7]:
links


Out[7]:
['/lodging/70449/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=75.01',
 '/lodging/70451/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=219',
 '/lodging/70452/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=109.65',
 '/lodging/70447/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=55.95',
 '/lodging/106805/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=189']

Loading Reviews

Next, we want to step through each page, and scrape the reviews for each hotel.


In [8]:
url_base = 'http://www.bringfido.com'  
r.update_url(url_base+links[0])  
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

tree = html.fromstring(formatted_result)

In [16]:
hotel_description = tree.xpath('//*[@class="body"]/text()')

details = tree.xpath('//*[@class="address"]/text()')

address = details[0]
csczip = details[1]
phone = details[2]

#Now using correct Xpath we are fetching URL of archives
reviews = tree.xpath('//*[@class="review_container"]')

texts = []
titles = []
authors = []
ratings = []

print(reviews)
print('')
for rev in reviews:
    titles.append(rev.xpath('div/div[1]/text()')[0])
    authors.append(rev.xpath('div/div[2]/text()')[0])
    texts.append(rev.xpath('div/div[3]/text()')[0])
    ratings.append(rev.xpath('div[2]/img/@src')[0].split('/')[-1][0:1])
    print(rev.xpath('div[2]/img/@src')[0].split('/')[-1][0:1])


[<Element div at 0x10d9ecdb8>, <Element div at 0x10d9ece10>]

5
3

In [17]:
titles


Out[17]:
['Great value and no pet fee', 'Getting old']

In [18]:
authors


Out[18]:
['\nErin\nin Washington, DC\n', '\nLawrence\nin Pittsfield, MA\n']

In [19]:
texts


Out[19]:
['My 75lb dog and I received a lovely welcome from two young gentlemen working at the desk the evening of 8/1/14. Check-in was very easy and even though this hotel is close to the highway and Ikea, there is a big patch of grass/bushes out front for dogs to relieve themselves --I didn\'t see bags so bring your own. Yes, I agree with review that it is an older place but my room with one queen bed was plenty clean and comfortable. When I travel with my dog, the less fancy the better. :). Also, it made it nice that there was a dog friendly restaurant less than 10 minute drive away...see review for "Basta". I definitely feel good about spending just over $100 at this hotel which was just a stop along the way in my travels. Recommend to other dog owners. (P.s. This site says small pets only but that is not up to date. The policy attached on this site shows that they take any size). ',
 'I stayed at this La Quinta and felt that it was getting up there in age and a little run down. It was decently clean, although not fully clean. The staff were fairly helpful.']

In [64]:
ratings


Out[64]:
['5', '3']

In [ ]: