2. Acquire the Data

Finding Data Sources

There are three place to get onion price and quantity information by market.

  1. Agmarket - This is the website run by the Directorate of Marketing & Inspection (DMI), Ministry of Agriculture, Government of India and provides daily price and arrival data for all agricultural commodities at national and state level. Unfortunately, the link to get Market-wise Daily Report for Specific Commodity (Onion for us) leads to a multipage aspx entry form to get data for each date. So it is like to require an involved scraper to get the data. Too much effort - Move on. Here is the best link to go to get what is available - http://agmarknet.nic.in/agnew/NationalBEnglish/SpecificCommodityWeeklyReport.aspx?ss=1
  1. Data.gov.in - This is normally a good place to get government data in a machine readable form like csv or xml. The Variety-wise Daily Market Prices Data of Onion is available for each year as an XML but unfortunately it does not include quantity information that is needed. It would be good to have both price and quantity - so even though this is easy, lets see if we can get both from a different source. Here is the best link to go to get what is available - https://data.gov.in/catalog/variety-wise-daily-market-prices-data-onion#web_catalog_tabs_block_10
  1. NHRDF - This is the website of National Horticultural Research & Development Foundation and maintains a database on Market Arrivals and Price, Area and Production and Export Data for three commodities - Garlic, Onion and Potatoes. We are in luck! It also has data from 1996 onwards and has only got one form to fill to get the data in a tabular form. Further it also has production and export data. Excellent. Lets use this. Here is the best link to got to get all that is available - http://nhrdf.org/en-us/DatabaseReports

Scraping the Data

Ways to Scrape Data

Now we can do this in two different levels of sophistication

  1. Automate the form filling process: The form on this page looks simple. But viewing source in the browser shows there is a form to fill with hidden fields and we will need to access it as a browser to get the session fields and then submit the form. This is a little bit more complicated than simple scraping a table on a webpage

  2. Manually fill the form: What if we manually fill the form with the desired form fields and then save the page as a html file. Then we can read this file and just scrape the table from it. Lets go with the simple way for now.

Scraping - Manual Form Filling

So let us fill the form to get a small subset of data and test our scraping process. We will start by getting the Monthwise Market Arrivals.

  • Crop Name: Onion
  • Month: January
  • Market: All
  • Year: 2016

The saved webpage is available at MonthWiseMarketArrivalsJan2016.html

Understand the HTML Structure

We need to scrape data from this html page... So let us try to understand the structure of the page.

  1. You can view the source of the page - typically Right Click and View Source on any browser and that would give your the source HTML for any page.

  2. You can open the developer tools in your browser and investigate the structure as you mouse over the page

  3. We can use a tools like Selector Gadget to understand the id's and classes used in the web page

Our data is under the <table> tag

Exercise - Finding the Table

Find the number of tables in the HTML Structure of MonthWiseMarketArrivalsJan2016.html?


In [34]:



Out[34]:
''

Find the exact table and #id attribute for the the table


In [ ]:

Manual Scraping


In [30]:
install.packages("rvest", repos='http://ftp.iitm.ac.in/cran/')


also installing the dependencies 'curl', 'openssl', 'XML', 'xml2', 'httr', 'selectr'

Warning message:
In install.packages("rvest", repos = "http://ftp.iitm.ac.in/cran/"): installation of package 'curl' had non-zero exit statusWarning message:
In install.packages("rvest", repos = "http://ftp.iitm.ac.in/cran/"): installation of package 'openssl' had non-zero exit statusWarning message:
In install.packages("rvest", repos = "http://ftp.iitm.ac.in/cran/"): installation of package 'XML' had non-zero exit statusWarning message:
In install.packages("rvest", repos = "http://ftp.iitm.ac.in/cran/"): installation of package 'xml2' had non-zero exit statusWarning message:
In install.packages("rvest", repos = "http://ftp.iitm.ac.in/cran/"): installation of package 'httr' had non-zero exit statusWarning message:
In install.packages("rvest", repos = "http://ftp.iitm.ac.in/cran/"): installation of package 'selectr' had non-zero exit statusWarning message:
In install.packages("rvest", repos = "http://ftp.iitm.ac.in/cran/"): installation of package 'rvest' had non-zero exit status
The downloaded source packages are in
	'/private/var/folders/04/r20f0_4n2m7cv23lr8t97wp00000gn/T/Rtmpi3LZwX/downloaded_packages'
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

In [2]:
getwd()


Out[2]:
'/Users/amitkaps/Dropbox/github/intro-R-data-science/onion'

In [1]:
library(rvest)


Loading required package: xml2

In [10]:
pg.out <- read_html('MonthWiseMarketArrivalsJan2016.html')

In [28]:
pg.out


Out[28]:
{xml_document}
<html>
[1] <head id="Head">&#13;\n<!--********************************************** ...
[2] <body id="Body">&#13;\n    &#13;\n    <form method="post" action="/en-us/ ...

In [16]:
html_node()


Error: Please supply one of css or xpath

In [17]:
# Read the page and convert to data frame
pg.table <-  pg.out %>% 
            html_node("#dnn_ctr974_MonthWiseMarketArrivals_GridView1")  %>%
            html_table()

In [18]:
str(pg.table)


'data.frame':	10228 obs. of  7 variables:
 $ Market              : chr  "ABOHAR(PB)" "ABOHAR(PB)" "ABOHAR(PB)" "ABOHAR(PB)" ...
 $ Month Name          : chr  "January" "January" "January" "January" ...
 $ Year                : chr  "2005" "2006" "2010" "2011" ...
 $ Arrival (q)         : int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ Price Minimum (Rs/q): chr  "404" "487" "1283" "3067" ...
 $ Price Maximum (Rs/q): chr  "493" "638" "1592" "3750" ...
 $ Modal Price (Rs/q)  : chr  "446" "563" "1460" "3433" ...

In [ ]:

Automatic Method

We need to scrape data from a table but we also need to submit a form to get the table. I will use a new library called rvest to do this. rvest is inspired from beautiful soup in python which I like, so lets give it a go. Here is the link to rvest if you want to read more - http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

We will start by getting the Monthwise Market Arrivals. The form on this page looks simple. But viewing source in the browser shows there form to fill with hidden fields and we will need to access it as a browser to get the session fields and then submit the form. First lets get the form.


In [19]:
library(rvest)

In [20]:
url <- "http://nhrdf.org/en-us/MonthWiseMarketArrivals"

In [21]:
# Set a session - then get the form - extract the first one
pg.session <- html_session(url)
pg.form <- html_form(pg.session)[[1]]

Now that we have the form, let see if we can fill the form. Even though the form gives us options to choose by name, inspecting the html shows clearly that the we need to add number for each one of the fields. Leaving them blank (for month, year and market) makes it equal to all. Lets get our data. (For testing. don't leave all them blank)


In [22]:
# Set scraping value 
# Crop = 1 for Onion, Year = numeric (blank for all years)
# MonthName = 1 for Jan and so on (blank for all months)
# Market = blank for all markets
crop <- 1
month <- 1
year <- 2016
market <- ""

In [23]:
# Fill the form with the values
pg.form <- html_form(pg.session)[[1]]
pg.form.filled <- set_values(pg.form, 
                      "dnn$dnnLANG$selectCulture" = "en-US",
                      "dnn$ctr974$MonthWiseMarketArrivals$Market" = market,
                      "dnn$ctr974$MonthWiseMarketArrivals$MonthName" = month,
                      "dnn$ctr974$MonthWiseMarketArrivals$Year" = year,
                      "dnn$ctr974$MonthWiseMarketArrivals$Crop" = crop)

In [25]:
# Submit the form and get the page
pg.submit <- submit_form(pg.session, pg.form.filled,
                         submit = 'dnn$ctr974$MonthWiseMarketArrivals$btnSearch')
pg.out <- read_html(pg.submit)

Now that we have the html with our table, we need to find it on our page using the css selector. Then convert it into a data frame. And then write it to a csv file to store for the next step.


In [26]:
# Read the page and convert to data frame
pg.table <-  pg.out %>% 
            html_node("#dnn_ctr974_MonthWiseMarketArrivals_GridView1")  %>%
            html_table()

In [27]:
str(pg.table)


'data.frame':	84 obs. of  7 variables:
 $ Market              : chr  "AGRA(UP)" "AHMEDABAD(GUJ)" "AHMEDNAGAR(MS)" "AJMER(RAJ)" ...
 $ Month Name          : chr  "January" "January" "January" "January" ...
 $ Year                : chr  "2016" "2016" "2016" "2016" ...
 $ Arrival (q)         : int  134200 198390 208751 4247 12350 9788 24800 600 507223 18435 ...
 $ Price Minimum (Rs/q): chr  "1039" "646" "175" "722" ...
 $ Price Maximum (Rs/q): chr  "1443" "1224" "1722" "1067" ...
 $ Modal Price (Rs/q)  : chr  "1349" "997" "1138" "939" ...

In [ ]:
file <- paste("MonthWiseMarketArrivalsJan2016", as.character(month), as.character(year), ".csv", sep="")

In [ ]:
write.csv(pg.table, file = file, quote = FALSE, row.names = FALSE)

Exercise - Downloading the Entire Month Wise Arrival Data


In [ ]: