2. Acquire the Data

Finding Data Sources

There are three place to get onion price and quantity information by market.

  1. Agmarket - This is the website run by the Directorate of Marketing & Inspection (DMI), Ministry of Agriculture, Government of India and provides daily price and arrival data for all agricultural commodities at national and state level. Unfortunately, the link to get Market-wise Daily Report for Specific Commodity (Onion for us) leads to a multipage aspx entry form to get data for each date. So it is like to require an involved scraper to get the data. Too much effort - Move on. Here is the best link to go to get what is available - http://agmarknet.nic.in/agnew/NationalBEnglish/SpecificCommodityWeeklyReport.aspx?ss=1
  1. Data.gov.in - This is normally a good place to get government data in a machine readable form like csv or xml. The Variety-wise Daily Market Prices Data of Onion is available for each year as an XML but unfortunately it does not include quantity information that is needed. It would be good to have both price and quantity - so even though this is easy, lets see if we can get both from a different source. Here is the best link to go to get what is available - https://data.gov.in/catalog/variety-wise-daily-market-prices-data-onion#web_catalog_tabs_block_10
  1. NHRDF - This is the website of National Horticultural Research & Development Foundation and maintains a database on Market Arrivals and Price, Area and Production and Export Data for three commodities - Garlic, Onion and Potatoes. We are in luck! It also has data from 1996 onwards and has only got one form to fill to get the data in a tabular form. Further it also has production and export data. Excellent. Lets use this. Here is the best link to got to get all that is available - http://nhrdf.org/en-us/DatabaseReports

Scraping the Data

Ways to Scrape Data

Now we can do this in two different levels of sophistication

  1. Automate the form filling process: The form on this page looks simple. But viewing source in the browser shows there form to fill with hidden fields and we will need to access it as a browser to get the session fields and then submit the form. This is a little bit more complicated than simple scraping a table on a webpage

  2. Manually fill the form: What if we manually fill the form with the desired form fields and then save the page as a html file. Then we can read this file and just scrape the table from it. Lets go with the simple way for now.

Scraping - Manual Form Filling

So let us fill the form to get a small subset of data and test our scraping process. We will start by getting the Monthwise Market Arrivals.

  • Crop Name: Onion
  • Month: January
  • Market: All
  • Year: 2016

The saved webpage is available at MonthWiseMarketArrivalsJan2016.html

Understand the HTML Structure

We need to scrape data from this html page... So let us try to understand the structure of the page.

  1. You can view the source of the page - typically Right Click and View Source on any browser and that would give your the source HTML for any page.

  2. You can open the developer tools in your browser and investigate the structure as you mouse over the page

  3. We can use a tools like Selector Gadget to understand the id's and classes' used in the web page

Our data is under the <table> tag

Exercise #1

Find the number of tables in the HTML Structure of MonthWiseMarketArrivalsJan2016.html?


In [ ]:

Find all the Tables


In [ ]:
# Import the library we need, which is Pandas
import pandas as pd

In [ ]:
# Read all the tables from the html document 
AllTables = pd.read_html('MonthWiseMarketArrivalsJan2016.html')

In [ ]:
# Let us find out how many tables has it found?
len(AllTables)

Exercise #2

Find the exact table of data we want in the list of AllTables?


In [ ]:

Get the exact table

To read the exact table we need to pass in an identifier value which would identify the table. We can use the attrs parameter in read_html to do so. The parameter we will pass is the id variable


In [ ]:
# So can we read our exact table
OneTable = pd.read_html('MonthWiseMarketArrivalsJan2016.html', 
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})

In [ ]:
# So how many tables have we got now
len(OneTable)

In [ ]:
# Show the table of data identifed by pandas with just the first five rows
OneTable[0].head()

However, we have not got the header correctly in our dataframe. Let us see if we can fix this.

To get help on any function just use ?? before the function to help. Run this function and see what additional parameter you need to define to get the header correctly


In [ ]:
??pd.read_html

Exercise #3

Read the html file again and ensure that the correct header is identifed by pandas?


In [ ]:
OneTable = pd.read_html('MonthWiseMarketArrivalsJan2016.html', header = 0,
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})

Show the top five rows of the dataframe you have read to ensure the headers are now correct.


In [ ]:
OneTable[0].head()

Dataframe Viewing


In [ ]:
# Let us store the dataframe in a df variable. You will see that as a very common convention in data science pandas use
df = OneTable[0]

In [ ]:
# Shape of the dateset - number of rows & number of columns in the dataframe
df.shape

In [ ]:
# Get the names of all the columns 
df.columns

In [ ]:
# Can we see sample rows - the top 5 rows
df.head()

In [ ]:
# Can we see sample rows - the bottom 5 rows
df.tail()

In [ ]:
# Can we access a specific columns
df["Market"]

In [ ]:
# Using the dot notation
df.Market

In [ ]:
# Selecting specific column and rows
df[0:5]["Market"]

In [ ]:
# Works both ways
df["Market"][0:5]

In [ ]:
#Getting unique values of State
pd.unique(df['Market'])

Downloading the Entire Month Wise Arrival Data


In [ ]:
AllTable = pd.read_html('MonthWiseMarketArrivals.html', header = 0,
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})

In [ ]:
AllTable[0].head()

In [ ]:
??pd.DataFrame.to_csv

In [ ]:
AllTable[0].columns

In [ ]:
# Change the column names to simpler ones
AllTable[0].columns = ['market', 'month', 'year', 'quantity', 'priceMin', 'priceMax', 'priceMod']

In [ ]:
AllTable[0].head()

In [ ]:
# Save the dataframe to a csv file
AllTable[0].to_csv('MonthWiseMarketArrivals.csv', index = False)