(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will build upon the sample code from the Lecture and attempt to get some basic information for each hotel. Then, we will fit a regression model on this information and try to analyze it.

One of the main disadvantages of scraping a website instead of using an API is that, without any notice, the website may change its layout and render our code useless. Something like that happened in our case. Tripadvisor changed the layout of the buttons that we use to navigate between the different pages of the results. This was the main reason people were having problem with executing the code.

The first task of the homework is to fix the scraping code. We basically need to replace the part where we are checking if there is another page and getting its link with new code that reflects the new navigation layout.

Then, for each hotel that our search returns, we will "click" (with the code of course) on it and scrape the information below.

Of course, feel free to collect even more data if you want.

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$\text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

Finally, we will use logistic regression to decide if a hotel is excellent or not. We classify a hotel as excellent if more than 60% of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In order to use code from a Python script file, we need to put that file in the same folder as the notebook and import it as a library. Then, we will be able to access it's functions. For example, in the case of the lecture code, we could do the following:

import scrape_solution as scrape

scrape.get_city_page()


Of course, you might need to modify and restructure the code so that it returns what you need.



In [ ]:




In [ ]:




In [ ]:




In [ ]:




In [2]:

# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
return HTML(styles)
css_styling()




Out[2]:

@font-face {
font-family: "Computer Modern";
src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');
}
.code_cell {
width: 105ex !important ;
margin-bottom: 15px !important;
}
div.cell {
margin-left: auto;
margin-right: auto;
width: 70%;
}
div.cell.selected {
border: thin rgba(171, 171, 171, 0.5) dashed;
}
h1 {
font-family: 'Alegreya Sans', sans-serif;
}
h2 {
font-family: 'EB Garamond', serif;
}
h3 {
font-family: 'EB Garamond', serif;
margin-top:12px;
margin-bottom: 3px;
}
h4 {
font-family: 'EB Garamond', serif;
}
h5 {
font-family: 'Alegreya Sans', sans-serif;
}
div.text_cell_render {
font-family: 'EB Garamond',Computer Modern, "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;
line-height: 145%;
font-size: 140%;
}
div.input_area {
border-color: rgba(0,0,0,0.10) !important;
background: #fafafa;
}
.CodeMirror {
font-family: "Source Code Pro";
font-size: 90%;
}
.prompt {
display: None;
}
.output {
}
.output_wrapper {
}
div.output_scroll {
width: inherit;
}
.inner_cell {
}
.text_cell_render h1 {
font-weight: 200;
font-size: 50pt;
line-height: 100%;
color:#CD2305;
margin-bottom: 0.5em;
margin-top: 0.5em;
display: block;
}
.text_cell_render h5 {
font-weight: 300;
font-size: 16pt;
color: #CD2305;
font-style: italic;
margin-bottom: .5em;
margin-top: 0.5em;
display: block;
}
.warning {
color: rgb( 240, 20, 20 )
}

MathJax.Hub.Config({
TeX: {
extensions: ["AMSmath.js"]
},
tex2jax: {
inlineMath: [ ['$','$'], ["\$","\$"] ],
displayMath: [ ['$$','$$'], ["\$","\$"] ]
},
displayAlign: 'center', // Change this to 'center' to center equations.
"HTML-CSS": {
styles: {'.MathJax_Display': {"margin": 4}}
}
});