Getting Denver Restaurants Openning Report

Denver Eater is a really cool website that gathers many information about restaurants. Coolest out of the coolest is their Opening Report. These are collection of new articles of new restaurants opened in the area since 2011.

My idea is to use this information to visualize restaurants opening on a map. In this blog post I'm going to show I did I get information off from this website and write it to a google sheet.

Load libraries

I like to use magrittr for the beautiful chaining function "%>%". rvest is a really easy to use R package that's inspired by beautiful soup library in Python. googlesheets provides a way to manage and access Google Sheets in R. And I used tidyr for some data cleaning and manipulation.


In [1]:
library(magrittr)
library(rvest)
library(googlesheets)
library(tidyr)


Loading required package: xml2

Attaching package: ‘tidyr’

The following object is masked from ‘package:magrittr’:

    extract

Find URLs for Opening Report

In the screenshot, the title of the news report has a link to where the link to the actual new report can be found. If I can scrape all of the links from these titles, I can tell R to read all of these reports for me. So the first step is to extract the h3 in all of the .m-block-river. I didn't try this out but I think .m-enter-box__body will work the same way.


In [2]:
# the opening report has 240 opening information since Oct 25, 2011 to today
# the url to the archive of all the records is similar to http://denver.eater.com/opening-report/archives/20
report_archive <- "http://denver.eater.com/opening-report/archives/"
pages <- 1:20

# the first step is to scrap all of the articles that appears in this opening report collection
urls <- lapply(pages, function(i) {
  paste0(report_archive, pages[i]) %>%
    read_html() %>%
    html_nodes('.m-block-river h3 a') %>%
    html_attr('href')
})

urls <- unlist(urls)

This is how the first 5 items look like.


In [3]:
head(urls)


Out[3]:
  1. 'http://denver.eater.com/2015/9/14/9325789/inside-scoop-lohi-platte-street'
  2. 'http://denver.eater.com/2015/9/11/9312903/cafe-marmotte-french-restaurant-wash-park'
  3. 'http://denver.eater.com/2015/9/8/9277649/glacier-ice-cream-lowry'
  4. 'http://denver.eater.com/2015/9/1/9241013/turn-in-bbq-food-truck-denver-barbecue'
  5. 'http://denver.eater.com/2015/9/1/9240573/the-broadmoor-restaurant-1858-seven-falls'
  6. 'http://denver.eater.com/2015/8/28/9219181/recess-beer-garden-lohi-opening'

Getting information from the opening report

Now within each link, there are 6 types of information I'm interested in this page:

  • Name of the restaurant: This information can come either from the little map in each report, or in bold font in the content.
  • Coordination of the restaurant: this information is only provided in the little map
  • Address to the restaurant: this information shows up in the little map. Sometimes, in the article, they will also spell out the address. But there's no special CSS stuff to use to locate this information
  • Time the news report is posted: I would prefer to have when this restaurant is established. But having something is better than nothing assuming this Denver Eater website reacts really quickly when a new restaurant is open.
  • Content of the news report: this information can be useful. Later I want to at least do a topic analysis to learn more about these new restaurants.

I wrote the following function to scrape these information and put them in a list.


In [4]:
# for each link in html, extract the detailed information of each restaurant
opening_report <- do.call(rbind, lapply(1:length(urls), function(i) {
  html_page <- read_html(urls[i])
  
  time <- html_page %>%
    html_nodes(".p-byline__time") %>%
    html_text()
  
  content <- html_page %>%
    html_nodes(".m-entry__body p") %>%
    html_text() %>%
    paste(., sep = "", collapse = " ")
  
  test <- html_page %>%
    html_nodes(".m-map-embed")
  
  if(length(test) == 0) {
    
    coordinate <- NA
    
    name <- html_page %>%
      html_nodes(".m-entry__body p b") %>%
      html_text() %>%
      paste(., sep = "", collapse = " ") %>%
      ifelse(length(.) == 0, NA, .)
    
    address <- NA
    
  } else {
    
    coordinate <- html_page %>%
      html_nodes(".m-chorus-simplemap") %>%
      html_attr("data-simplemap-latlong") %>%
      unlist() %>%
      unname()
    
    name <- html_page %>%
      html_nodes(".m-map-embed__body h2") %>%
      html_text()
    
    address <- html_page %>%
      html_nodes(".m-map-embed__address") %>%
      html_text()
  }
  cbind(name = name, coordinate = coordinate, address = address, time = time, content = content, url = urls[i])
}))

The list doesn't look very good at a first glance.


In [6]:
head(opening_report, n = 1)


Out[6]:
namecoordinateaddresstimecontenturl
Inside Scoop [39.7570475, -105.0087600] 1535 Platte St, Denver, CO 80202 Sep 14, 2015, 1:17p Littleton-born Inside Scoop opened its Denver outpost a little over a week ago at 1535 Platte Street. A family-owned and operated business, the shop crafts its creations in the shop and offers some 20 different flavors that rotate and change weekly. Waffle cones are also made in-house. The decor is simple but inviting with casual tables and sunny yellow wall accents and local art displayed throughout the space. Right now, guests can enjoy classics like vanilla and mint-chocolate chip but also fun flavors like horchata, lemon crunch, and German chocolate cake.http://denver.eater.com/2015/9/14/9325789/inside-scoop-lohi-platte-street

There are many places with strange blanks before/after a paragraph. Coordination all formated in decimal degrees but with a wired / useless bracket. I used separate function to split them into latitude and longitude to make it easier for mapping later. I also converted into a data frame.


In [8]:
# write this opening report results to a google sheet
opening_report_clean <- apply(opening_report, 2, function(x) {
  gsub("^\\s+|\\n|\\[|\\]|\\s+$", "", x)
}) %>%
  as.data.frame(., stringsAsFactors = FALSE) %>%
  separate(coordinate, into = c("latitude", "longitude"), sep = ",")

Now my data frame with the 260+ opening report looks nice and clean except for a few missing data point


In [9]:
head(opening_report_clean, n = 1)


Out[9]:
namelatitudelongitudeaddresstimecontenturl
1Inside Scoop39.7570475 -105.00876001535 Platte St, Denver, CO 80202Sep 14, 2015, 1:17pLittleton-born Inside Scoop opened its Denver outpost a little over a week ago at 1535 Platte Street. A family-owned and operated business, the shop crafts its creations in the shop and offers some 20 different flavors that rotate and change weekly. Waffle cones are also made in-house. The decor is simple but inviting with casual tables and sunny yellow wall accents and local art displayed throughout the space. Right now, guests can enjoy classics like vanilla and mint-chocolate chip but also fun flavors like horchata, lemon crunch, and German chocolate cake.http://denver.eater.com/2015/9/14/9325789/inside-scoop-lohi-platte-street

Write the opening report to a google sheet

Now I want to write the opening report to a google sheet. I really liked the googlesheets package. It makes everything very easy.

One thing I found out might be useful for everybody is that if you messed up the app authorization the first time, you can do the following thing to make it up. (By mess up I mean you forget to copy paste the authorization link that googlesheets prints out in your R console to your browers and do what you needed to do - like what happened to me many times. Or anything wired happened during this process makes you cannot finish it).

In your working directory, if you enable your operating system to show hidden items, you will see this .httr-oauth file. This is created by httr package (which googlesheets uses) when passing the authorization. If you delete this file, then run gs_auth() again to do this the second time. If you know how to re-do the authorization normally, please let me know!


In [12]:
gsheet_opening_report <- gs_new(title = "opening_report_blog")
gs_edit_cells(gsheet_opening_report, input = opening_report_clean)


Auto-refreshing stale OAuth token.
Sheet "opening_report_blog" created in Google Drive.
Worksheet dimensions: 1000 x 26.
Range affected by the update: "A1:G268"
Worksheet "Sheet1" successfully updated with 1876 new value(s).