bf_qt_scraping

This notebook describes how hotel data can be scraped using PyQT.

The items we want to extract are:

the hotels for a given city
links to each hotel page
text hotel summary
text hotel description

Once the links for each hotel are determined, I then want to extract the following items pertaining to each review:

title
author
text
rating



In [1]:

    
import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html



In [2]:

    
class Render(QWebPage):  
    def __init__(self, url):  
        self.app = QApplication(sys.argv)  
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()  

    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit() 
    
    def update_url(self, url):
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()



In [68]:

    
url = 'http://www.bringfido.com/lodging/city/new_haven_ct_us'  
#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()



In [65]:

    
# result



In [ ]:

    
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())



In [16]:

    
#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)



In [24]:

    
tree.text_content









    Out[24]:





<bound method HtmlElement.text_content of <Element html at 0x10c181c00>>



In [19]:

    
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//*[@id="results_list"]/div')
print archive_links

[]



In [2]:

    
url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

tree = html.fromstring(formatted_result)



In [5]:

    
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//*[@class="campaign"]/a/@href')

# for lnk in archive_links:
#     print(lnk)

Now the Hotels



In [3]:

    
url = 'http://www.bringfido.com/lodging/city/new_haven_ct_us'  
r = Render(url)  
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

tree = html.fromstring(formatted_result)



In [4]:

    
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//*[@id="results_list"]/div')

print(archive_links)
print('')

for lnk in archive_links:
    print(lnk.xpath('div[2]/h1/a/text()')[0])
    print(lnk.text_content())
    print('*'*25)









    



[<Element div at 0x10adf9788>, <Element div at 0x10adf97e0>, <Element div at 0x10adf9838>, <Element div at 0x10adf9890>, <Element div at 0x10adf98e8>]

La Quinta Inn & Suites New Haven
La Quinta Inn & Suites New HavenNew Haven, CT, USLa Quinta Inn & Suites New Haven is pet friendly! Up to two pets of any size are allowed in each room for no additional fee or deposit.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$75 (no pet fee)VISIT WEBSITE
*************************
Omni New Haven Hotel at Yale
Omni New Haven Hotel at YaleNew Haven, CT, USOmni New Haven Hotel At Yale welcomes a maximum of two dogs, 25lbs or less, per guest room for an additioanl $50 per stay. Dogs over 25lbs require prior approval from the manager. Please note that ...Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$219 + pet feeCHECK RATES
*************************
Premiere Hotel & Suites
Premiere Hotel & SuitesNew Haven, CT, USPremiere Hotel And Suites allows up to two dogs (50 lbs or less) per guest room for an additional fee of $75 per stay. Larger dogs may be permitted with prior management approval.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$110 + pet feeCHECK RATES
*************************
Econo Lodge Conference Center New Haven
Econo Lodge Conference Center New HavenNew Haven, CT, USEcono Lodge Conference Center welcomes up to two pets, 25lbs or less, in a limited number of pet-friendly rooms, for an additional $10 per pet, per night.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$56 + pet feeCHECK RATES
*************************
The Study at Yale
The Study at YaleNew Haven, CT, USThe Study At Yale allows up to two dogs (50 lbs or less) for an additional fee of $50 per pet per stay.Hotel Overview | Map | Photos | Guest ReviewsLow Rates from$189 + pet feeCHECK RATES
*************************

Now Get the Links



In [5]:

    
links = []
for lnk in archive_links:
    print(lnk.xpath('div/h1/a/@href')[0])
    links.append(lnk.xpath('div/h1/a/@href')[0])
    print('*'*25)









    



/lodging/70449/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=75.01
*************************
/lodging/70451/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=219
*************************
/lodging/70452/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=109.65
*************************
/lodging/70447/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=55.95
*************************
/lodging/106805/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=189
*************************



In [6]:

    
lnk.xpath('//*/div/h1/a/@href')[0]









    Out[6]:





'/lodging/70449/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=75.01'



In [7]:

    
links









    Out[7]:





['/lodging/70449/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=75.01',
 '/lodging/70451/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=219',
 '/lodging/70452/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=109.65',
 '/lodging/70447/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=55.95',
 '/lodging/106805/?cid=14745&ar=&dt=&rm=1&ad=1&ch=0&dg=1&rt=189']

Loading Reviews

Next, we want to step through each page, and scrape the reviews for each hotel.



In [8]:

    
url_base = 'http://www.bringfido.com'  
r.update_url(url_base+links[0])  
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

tree = html.fromstring(formatted_result)



In [16]:

    
hotel_description = tree.xpath('//*[@class="body"]/text()')

details = tree.xpath('//*[@class="address"]/text()')

address = details[0]
csczip = details[1]
phone = details[2]

#Now using correct Xpath we are fetching URL of archives
reviews = tree.xpath('//*[@class="review_container"]')

texts = []
titles = []
authors = []
ratings = []

print(reviews)
print('')
for rev in reviews:
    titles.append(rev.xpath('div/div[1]/text()')[0])
    authors.append(rev.xpath('div/div[2]/text()')[0])
    texts.append(rev.xpath('div/div[3]/text()')[0])
    ratings.append(rev.xpath('div[2]/img/@src')[0].split('/')[-1][0:1])
    print(rev.xpath('div[2]/img/@src')[0].split('/')[-1][0:1])









    



[<Element div at 0x10d9ecdb8>, <Element div at 0x10d9ece10>]

5
3



In [17]:

    
titles









    Out[17]:





['Great value and no pet fee', 'Getting old']



In [18]:

    
authors









    Out[18]:





['\nErin\nin Washington, DC\n', '\nLawrence\nin Pittsfield, MA\n']



In [19]:

    
texts









    Out[19]:





['My 75lb dog and I received a lovely welcome from two young gentlemen working at the desk the evening of 8/1/14. Check-in was very easy and even though this hotel is close to the highway and Ikea, there is a big patch of grass/bushes out front for dogs to relieve themselves --I didn\'t see bags so bring your own. Yes, I agree with review that it is an older place but my room with one queen bed was plenty clean and comfortable. When I travel with my dog, the less fancy the better. :). Also, it made it nice that there was a dog friendly restaurant less than 10 minute drive away...see review for "Basta". I definitely feel good about spending just over $100 at this hotel which was just a stop along the way in my travels. Recommend to other dog owners. (P.s. This site says small pets only but that is not up to date. The policy attached on this site shows that they take any size). ',
 'I stayed at this La Quinta and felt that it was getting up there in age and a little run down. It was decently clean, although not fully clean. The staff were fairly helpful.']



In [64]:

    
ratings









    Out[64]:





['5', '3']



In [ ]: