There are three place to get onion price and quantity information by market.
Now we can do this in two different levels of sophistication
Automate the form filling process: The form on this page looks simple. But viewing source in the browser shows there is a form to fill with hidden fields and we will need to access it as a browser to get the session fields and then submit the form. This is a little bit more complicated than simple scraping a table on a webpage
Manually fill the form: What if we manually fill the form with the desired form fields and then save the page as a html file. Then we can read this file and just scrape the table from it. Lets go with the simple way for now.
So let us fill the form to get a small subset of data and test our scraping process. We will start by getting the Monthwise Market Arrivals.
The saved webpage is available at MonthWiseMarketArrivalsJan2016.html
We need to scrape data from this html page... So let us try to understand the structure of the page.
You can view the source of the page - typically Right Click and View Source on any browser and that would give your the source HTML for any page.
You can open the developer tools in your browser and investigate the structure as you mouse over the page
We can use a tools like Selector Gadget to understand the id's and classes used in the web page
Our data is under the <table> tag
Find the number of tables in the HTML Structure of MonthWiseMarketArrivalsJan2016.html?
In [34]:
Out[34]:
Find the exact table and #id attribute for the the table
In [ ]:
In [30]:
install.packages("rvest", repos='http://ftp.iitm.ac.in/cran/')
In [2]:
getwd()
Out[2]:
In [1]:
library(rvest)
In [10]:
pg.out <- read_html('MonthWiseMarketArrivalsJan2016.html')
In [28]:
pg.out
Out[28]:
In [16]:
html_node()
In [17]:
# Read the page and convert to data frame
pg.table <- pg.out %>%
html_node("#dnn_ctr974_MonthWiseMarketArrivals_GridView1") %>%
html_table()
In [18]:
str(pg.table)
In [ ]:
We need to scrape data from a table but we also need to submit a form to get the table. I will use a new library called rvest to do this. rvest is inspired from beautiful soup in python which I like, so lets give it a go. Here is the link to rvest if you want to read more - http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
We will start by getting the Monthwise Market Arrivals. The form on this page looks simple. But viewing source in the browser shows there form to fill with hidden fields and we will need to access it as a browser to get the session fields and then submit the form. First lets get the form.
In [19]:
library(rvest)
In [20]:
url <- "http://nhrdf.org/en-us/MonthWiseMarketArrivals"
In [21]:
# Set a session - then get the form - extract the first one
pg.session <- html_session(url)
pg.form <- html_form(pg.session)[[1]]
Now that we have the form, let see if we can fill the form. Even though the form gives us options to choose by name, inspecting the html shows clearly that the we need to add number for each one of the fields. Leaving them blank (for month, year and market) makes it equal to all. Lets get our data. (For testing. don't leave all them blank)
In [22]:
# Set scraping value
# Crop = 1 for Onion, Year = numeric (blank for all years)
# MonthName = 1 for Jan and so on (blank for all months)
# Market = blank for all markets
crop <- 1
month <- 1
year <- 2016
market <- ""
In [23]:
# Fill the form with the values
pg.form <- html_form(pg.session)[[1]]
pg.form.filled <- set_values(pg.form,
"dnn$dnnLANG$selectCulture" = "en-US",
"dnn$ctr974$MonthWiseMarketArrivals$Market" = market,
"dnn$ctr974$MonthWiseMarketArrivals$MonthName" = month,
"dnn$ctr974$MonthWiseMarketArrivals$Year" = year,
"dnn$ctr974$MonthWiseMarketArrivals$Crop" = crop)
In [25]:
# Submit the form and get the page
pg.submit <- submit_form(pg.session, pg.form.filled,
submit = 'dnn$ctr974$MonthWiseMarketArrivals$btnSearch')
pg.out <- read_html(pg.submit)
Now that we have the html with our table, we need to find it on our page using the css selector. Then convert it into a data frame. And then write it to a csv file to store for the next step.
In [26]:
# Read the page and convert to data frame
pg.table <- pg.out %>%
html_node("#dnn_ctr974_MonthWiseMarketArrivals_GridView1") %>%
html_table()
In [27]:
str(pg.table)
In [ ]:
file <- paste("MonthWiseMarketArrivalsJan2016", as.character(month), as.character(year), ".csv", sep="")
In [ ]:
write.csv(pg.table, file = file, quote = FALSE, row.names = FALSE)
In [ ]: