Denver Eater is a really cool website that gathers many information about restaurants. Coolest out of the coolest is their Opening Report. These are collection of new articles of new restaurants opened in the area since 2011.
My idea is to use this information to visualize restaurants opening on a map. In this blog post I'm going to show I did I get information off from this website and write it to a google sheet.
In [1]:
library(magrittr)
library(rvest)
library(googlesheets)
library(tidyr)
In the screenshot, the title of the news report has a link to where the link to the actual new report can be found. If I can scrape all of the links from these titles, I can tell R to read all of these reports for me. So the first step is to extract the h3 in all of the .m-block-river. I didn't try this out but I think .m-enter-box__body will work the same way.
In [2]:
# the opening report has 240 opening information since Oct 25, 2011 to today
# the url to the archive of all the records is similar to http://denver.eater.com/opening-report/archives/20
report_archive <- "http://denver.eater.com/opening-report/archives/"
pages <- 1:20
# the first step is to scrap all of the articles that appears in this opening report collection
urls <- lapply(pages, function(i) {
paste0(report_archive, pages[i]) %>%
read_html() %>%
html_nodes('.m-block-river h3 a') %>%
html_attr('href')
})
urls <- unlist(urls)
This is how the first 5 items look like.
In [3]:
head(urls)
Out[3]:
Now within each link, there are 6 types of information I'm interested in this page:
I wrote the following function to scrape these information and put them in a list.
In [4]:
# for each link in html, extract the detailed information of each restaurant
opening_report <- do.call(rbind, lapply(1:length(urls), function(i) {
html_page <- read_html(urls[i])
time <- html_page %>%
html_nodes(".p-byline__time") %>%
html_text()
content <- html_page %>%
html_nodes(".m-entry__body p") %>%
html_text() %>%
paste(., sep = "", collapse = " ")
test <- html_page %>%
html_nodes(".m-map-embed")
if(length(test) == 0) {
coordinate <- NA
name <- html_page %>%
html_nodes(".m-entry__body p b") %>%
html_text() %>%
paste(., sep = "", collapse = " ") %>%
ifelse(length(.) == 0, NA, .)
address <- NA
} else {
coordinate <- html_page %>%
html_nodes(".m-chorus-simplemap") %>%
html_attr("data-simplemap-latlong") %>%
unlist() %>%
unname()
name <- html_page %>%
html_nodes(".m-map-embed__body h2") %>%
html_text()
address <- html_page %>%
html_nodes(".m-map-embed__address") %>%
html_text()
}
cbind(name = name, coordinate = coordinate, address = address, time = time, content = content, url = urls[i])
}))
The list doesn't look very good at a first glance.
In [6]:
head(opening_report, n = 1)
Out[6]:
There are many places with strange blanks before/after a paragraph. Coordination all formated in decimal degrees but with a wired / useless bracket. I used separate function to split them into latitude and longitude to make it easier for mapping later. I also converted into a data frame.
In [8]:
# write this opening report results to a google sheet
opening_report_clean <- apply(opening_report, 2, function(x) {
gsub("^\\s+|\\n|\\[|\\]|\\s+$", "", x)
}) %>%
as.data.frame(., stringsAsFactors = FALSE) %>%
separate(coordinate, into = c("latitude", "longitude"), sep = ",")
Now my data frame with the 260+ opening report looks nice and clean except for a few missing data point
In [9]:
head(opening_report_clean, n = 1)
Out[9]:
Now I want to write the opening report to a google sheet. I really liked the googlesheets package. It makes everything very easy.
One thing I found out might be useful for everybody is that if you messed up the app authorization the first time, you can do the following thing to make it up. (By mess up I mean you forget to copy paste the authorization link that googlesheets prints out in your R console to your browers and do what you needed to do - like what happened to me many times. Or anything wired happened during this process makes you cannot finish it).
In your working directory, if you enable your operating system to show hidden items, you will see this .httr-oauth file. This is created by httr package (which googlesheets uses) when passing the authorization. If you delete this file, then run gs_auth() again to do this the second time. If you know how to re-do the authorization normally, please let me know!
In [12]:
gsheet_opening_report <- gs_new(title = "opening_report_blog")
gs_edit_cells(gsheet_opening_report, input = opening_report_clean)