Use the requests module to make a HTTP request to http://www.tripadvisor.com
In [ ]:
In [ ]:
In [ ]:
Sometimes, you may want a little bit of information - a movie rating, stock price, or product availability - but the information is available only in HTML pages, surrounded by ads and extraneous content.
To do this we build an automated web fetcher called a crawler or spider. After the HTML contents have been retrived from the remote web servers, a scraper parses it to find the needle in the haystack.
The bs4 module can be used for searching a webpage (HTML file) and pulling required data from it. It does three things to make a HTML page searchable-
This module takes the HTML page and creates four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.
Read more about BeautifulSoup : https://www.crummy.com/software/BeautifulSoup/bs4/doc/
In [ ]:
<h1 id="HEADING" property="name" class="heading_name ">
<div class="heading_height"></div>
"
Le Jardin Napolitain
"
</h1>
Let us write the code to parse a html page. We will use the trip advisor URL for an infamous restaurant - https://www.tripadvisor.com/Restaurant_Review-g187147-d1751525-Reviews-Cafe_Le_Dome-Paris_Ile_de_France.html
In [ ]:
In [ ]:
<div class="entry">
<p class="partial_entry">
Popped in on way to Eiffel Tower for lunch, big mistake.
Pizza was disgusting and service was poor.
It’s a shame Trip Advisor don’t let you score venues zero....
<span class="taLnk ulBlueLinks" onclick="widgetEvCall('handlers.clickExpand',event,this);">More
</span>
</p>
</div>
Let us try and find all the < p > (paragraph) tags in the soup:
In [ ]:
In [ ]:
Using yesterdays sentiment analysis code and the corpus of sentiment found in the word_sentiment.csv file, calculate the sentiment of the reviews.
In [ ]:
#Enter your code here
In [ ]:
Using the review data and the ratings available is there any way we can improve the corpus of sentiments "word_sentiment.csv" file?
In [ ]:
Some websites make request in the background to fetch the data from the server and load it into the page dynamically (often an AJAX request). In this case, the url will not indicate the location of the data. To find such requests, open the Chrome or Firefox Developer Tools, you can load the page, go to the “Network” tab and then look through the all of the requests that are being sent in the background to find the one that’s returning the data you’re looking for. Start by filtering the requests to only XHR or JS to make this easier.
Once you find the AJAX request that returns the data you’re hoping to scrape, then you can make your scraper send requests to this URL, instead of to the parent page’s URL. If you’re lucky, the response will be encoded with JSON which is even easier to parse than HTML.
In [ ]:
By default, the requests library sets the User-Agent header on each request to something like “python-requests/3.xx.x”. You can change it to identify your web scraper, perhaps providing a contact email address so that an admin from the target website can reach out if they see you in their logs.
More commonly, this can be used to make it appear that the request is coming from a normal web browser, and not a web scraping program.
In [ ]:
header = {
'cookie': 'TAUnique=%1%enc%3AHvAwOscAcmfzIwJbsS10GnXn4FrCUpCm%2Bnw21XKuzXoV7vSwMEnyTA%3D%3D; fbm_162729813767876=base_domain=.tripadvisor.com; TACds=B.3.11419.1.2019-03-31; TASSK=enc%3AABCGM1r6xBekOjRaaQZ3QVS7dP4cwZ8sombvPTq8xK6xN55i7TN8puwZdwvXvG1i%2FJ2UQXYG1CwsU%2BXLwLs5qIxnmW5qbLt4I48DfK5FhHpwUw3ZgrbskK%2FjDc4ENfcCXw%3D%3D; ServerPool=C; TART=%1%enc%3A8yMCW7EtdBqPX0oluvfOS5mBk6DRMHXwNEAPJlcpaDumiCWsxs%2BxfBbTYsxpa%2F9l%2FJzCllshf9g%3D; VRMCID=%1%V1*id.10568*llp.%2FRestaurant_Review-g187147-d3405673-Reviews-La_Terrasse_Vedettes_de_Paris-Paris_Ile_de_France%5C.html*e.1557691551614; PMC=V2*MS.36*MD.20190505*LD.20190506; PAC=ALNtqHPT2KJjQwExTPJt3gCvzvDYH_x63ZOT4b3LetvkHuHXcEUY4eLx0TqKGzOIpoXF3K_j57rNigUkWJzSv7TtTna4L3DKcfiaeK9zT9ixGEevH6QwZVd-PdMyr9y5aRzjEVAfid42zC4WXeTcQTJkPVwGMCW2mB2k3xxfB78GgJFIR_I9vf6Bzhq89x_UTTUcQgFpCr8GEFV9GpJWG8UNGeriJSbmPtCXA10oXl5ox7U9TQvSILLSH8PdrP8nwUQMRnfUA_fKbXTaRgH4tzBwZQpbd1vlOOg7fKyfIN9V95PzNOXBEQCJIo3z09Nux0tyZZVX0PX_zI_moLpr9Od3eSi1E8Hm5QcLyG9QNfA1C5WckG9GOV5VKEL0bxDY5TG1smCaQDXpRLkvp8w2bD7vyI2e27WFbtuYvJDJ126v2_KyZmVbG3laZlvWrX2kWGL13IyhVS2Ivjr_9uJAwMpBKuNByH0FBU3ziJcRdqkXiz6lnYMSRSQ1Y8Dmkjkrc0DNTABvuHjbZ7Fh0LOINswW_wrkVsP4PjDq1IVh7IY0hLE_W1G1DKlROc5BZEOjcw%3D%3D; BEPIN=%1%16a8c46770b%3Bbak92b.b.tripadvisor.com%3A10023%3B; TATravelInfo=V2*A.2*MG.-1*HP.2*FL.3*DSM.1557131589173*RS.1*RY.2019*RM.5*RD.6*RH.20*RG.2; CM=%1%RestAds%2FRPers%2C%2C-1%7CRCPers%2C%2C-1%7Csesstch15%2C%2C-1%7CCYLPUSess%2C%2C-1%7Ctvsess%2C%2C-1%7CPremiumMCSess%2C%2C-1%7CRestPartSess%2C%2C-1%7CUVOwnersSess%2C%2C-1%7CRestPremRSess%2C%2C-1%7CPremRetPers%2C%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7C%24%2C%2C-1%7Ct4b-sc%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CLaFourchette+Banners%2C%2C-1%7Csesshours%2C%2C-1%7CTARSWBPers%2C%2C-1%7CTheForkORSess%2C%2C-1%7CTheForkRRSess%2C%2C-1%7CRestAds%2FRSess%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7Csesslaf%2C%2C-1%7CRestPartPers%2C%2C-1%7CCYLPUPers%2C%2C-1%7CCCUVOwnSess%2C%2C-1%7Cperslaf%2C%2C-1%7CUVOwnersPers%2C%2C-1%7Csh%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7CCCPers%2C%2C-1%7Cb2bmcsess%2C%2C-1%7CSPMCPers%2C%2C-1%7Cperswifi%2C%2C-1%7CPremRetSess%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CRestAdsCCPers%2C%2C-1%7CTrayssess%2C%2C-1%7CPremiumORPers%2C%2C-1%7CSPORPers%2C%2C-1%7Cperssticker%2C%2C-1%7Cbooksticks%2C%2C-1%7CSPMCWBSess%2C%2C-1%7Cbookstickp%2C%2C-1%7CPremiumMobSess%2C%2C-1%7Csesswifi%2C%2C-1%7Ct4b-pc%2C%2C-1%7CWShadeSeen%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C9%2C-1%7CPremiumSURPers%2C%2C-1%7CCCUVOwnPers%2C%2C-1%7CTBPers%2C%2C-1%7Cperstch15%2C%2C-1%7CCCSess%2C2%2C-1%7CCYLSess%2C%2C-1%7Cpershours%2C%2C-1%7CPremiumORSess%2C%2C-1%7CRestAdsPers%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CTrayspers%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CTADORSess%2C%2C-1%7CAdsRetPers%2C%2C-1%7CMCPPers%2C%2C-1%7CSPMCSess%2C%2C-1%7Cpers_rev%2C%2C-1%7Cmdpers%2C%2C-1%7Cmds%2C1557131565748%2C1557217965%7CSPMCWBPers%2C%2C-1%7CRBAPers%2C%2C-1%7CHomeAPers%2C%2C-1%7CRCSess%2C%2C-1%7CRestAdsCCSess%2C%2C-1%7CRestPremRPers%2C%2C-1%7Cpssamex%2C%2C-1%7CCYLPers%2C%2C-1%7Ctvpers%2C%2C-1%7CTBSess%2C%2C-1%7CAdsRetSess%2C%2C-1%7CMCPSess%2C%2C-1%7CTADORPers%2C%2C-1%7CTheForkORPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CTARSWBSess%2C%2C-1%7CRestAdsSess%2C%2C-1%7CRBASess%2C%2C-1%7Cmdsess%2C%2C-1%7C; fbsr_162729813767876=wtGNSIucBSm5EusyRkPyX_GfZwxNkyHLxTRli46iHoM.eyJjb2RlIjoiQVFBUHV3SlZpOVNXQXVkMDh1bUdaYjZ2R3hBMkdfdFBZdm9Bb2l2cDEzSDNvaG1ESjRkamo1V1A3dnB5WloxWmwzeWxFTmdCT0dCbTB6dzc1S2pwUHFKak5nQVNKMGNqOEtvUVY1YzZXNHhNQ1FlMURNNXJOUUpMeEJldjlBS2xKNnhVVjVXQ1ZaajZjN1k4X1ZWeGdxbzlIclhKT3BvUDZSLTVzNkVUZ3Q5Q0xMNmg0ZnZIY0pMSm1KdXJwN0lGVFBSOUdvX0Z4M0FiM0VWQ1RnVFNGNzc2NFFuU29fdER5VFk3TWY0V0VKSFZXZi11ME1pa2ZWS1ZzUHdHQlBOOE1xZkVQNjZfZHpZMVdnSEVfcWR4d2FHN2xNODNyR1BWaDVwdDdodlFQQmFBbGtzU21IYjZiSktEaGVGajM4WTg3TGxUUF9hNEVGUjVjOVdoOVNhY2RmV04iLCJ1c2VyX2lkIjoiMTY1NjQ2NDcxNSIsImFsZ29yaXRobSI6IkhNQUMtU0hBMjU2IiwiaXNzdWVkX2F0IjoxNTU3MTMzMTgxfQ; TAReturnTo=%1%%2FRestaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html; roybatty=TNI1625!APyGsDM6tcKypRo49myenvbO5Zyk367lJP3JEhTSBrfno%2F4Bbienyfvs6Q2DU%2F2UmkzjN1pKquiSNGeY2cXQm8s8oX1jKwXT8hgK3GL%2B6psZHdp4k7TF4F52uoI2kQ1e9Ni2k9Ub8D5ak%2FXgN%2F9as9m2HZIB0G6SZnZMT%2FPD73Fo%2C1; SRT=%1%enc%3A8yMCW7EtdBqPX0oluvfOS5mBk6DRMHXwNEAPJlcpaDumiCWsxs%2BxfBbTYsxpa%2F9l%2FJzCllshf9g%3D; TASession=V2ID.2C4059CFCBC27797DA97994A5CF94A28*SQ.233*LS.PageMoniker*GR.7*TCPAR.44*TBR.80*EXEX.60*ABTR.87*PHTB.57*FS.2*CPU.54*HS.recommended*ES.popularity*DS.5*SAS.popularity*FPS.oldFirst*LF.en*FA.1*DF.0*IR.4*TRA.false*LD.304551; TAUD=LA-1557055610999-1*RDD-1-2019_05_05*RD-75954750-2019_05_06.9784431*HDD-75978369-2019_05_19.2019_05_20.1*HC-76743574*LG-77588176-2.1.F.*LD-77588177-.....',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}
response = requests.post("https://www.tripadvisor.com/RestaurantSearch?Action=PAGE&geo=304551&ajax=1&itags=10591&sortOrder=relevance&o=a30&availSearchEnabled=false", headers=header)