There are three place to get onion price and quantity information by market.
Now we can do this in two different levels of sophistication
Automate the form filling process: The form on this page looks simple. But viewing source in the browser shows there form to fill with hidden fields and we will need to access it as a browser to get the session fields and then submit the form. This is a little bit more complicated than simple scraping a table on a webpage
Manually fill the form: What if we manually fill the form with the desired form fields and then save the page as a html file. Then we can read this file and just scrape the table from it. Lets go with the simple way for now.
So let us fill the form to get a small subset of data and test our scraping process. We will start by getting the Monthwise Market Arrivals.
The saved webpage is available at MonthWiseMarketArrivalsJan2016.html
We need to scrape data from this html page... So let us try to understand the structure of the page.
You can view the source of the page - typically Right Click and View Source on any browser and that would give your the source HTML for any page.
You can open the developer tools in your browser and investigate the structure as you mouse over the page
We can use a tools like Selector Gadget to understand the id's and classes' used in the web page
Our data is under the <table> tag
Find the number of tables in the HTML Structure of MonthWiseMarketArrivalsJan2016.html?
In [ ]:
In [1]:
# Import the library we need, which is Pandas
import pandas as pd
In [2]:
# Read all the tables from the html document
AllTables = pd.read_html('MonthWiseMarketArrivalsJan2016.html')
In [3]:
# Let us find out how many tables has it found?
len(AllTables)
Out[3]:
In [4]:
type(AllTables)
Out[4]:
In [5]:
AllTables[4]
Out[5]:
In [6]:
# So can we read our exact table
OneTable = pd.read_html('MonthWiseMarketArrivalsJan2016.html',
attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})
In [7]:
# So how many tables have we got now
len(OneTable)
Out[7]:
In [8]:
# Show the table of data identifed by pandas with just the first five rows
OneTable[0].head()
Out[8]:
However, we have not got the header correctly in our dataframe. Let us see if we can fix this.
To get help on any function just use ??
before the function to help. Run this function and see what additional parameter you need to define to get the header correctly
In [ ]:
??pd.read_html
In [11]:
OneTable = pd.read_html('MonthWiseMarketArrivalsJan2016.html', header = 0,
attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})
Show the top five rows of the dataframe you have read to ensure the headers are now correct.
In [10]:
OneTable[0].head()
Out[10]:
In [12]:
# Let us store the dataframe in a df variable. You will see that as a very common convention in data science pandas use
df = OneTable[0]
In [13]:
# Shape of the dateset - number of rows & number of columns in the dataframe
df.shape
Out[13]:
In [14]:
# Get the names of all the columns
df.columns
Out[14]:
In [15]:
# Can we see sample rows - the top 5 rows
df.head()
Out[15]:
In [16]:
# Can we see sample rows - the bottom 5 rows
df.tail()
Out[16]:
In [17]:
# Can we access a specific columns
df["Market"]
Out[17]:
In [18]:
# Using the dot notation
df.Market
Out[18]:
In [19]:
# Selecting specific column and rows
df[0:5]["Market"]
Out[19]:
In [20]:
# Works both ways
df["Market"][0:5]
Out[20]:
In [21]:
#Getting unique values of State
pd.unique(df['Market'])
Out[21]:
In [ ]:
AllTable = pd.read_html('MonthWiseMarketArrivals.html', header = 0,
attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})
In [ ]:
AllTable[0].head()
In [ ]:
??pd.DataFrame.to_csv
In [ ]:
AllTable[0].columns
In [ ]:
# Change the column names to simpler ones
AllTable[0].columns = ['market', 'month', 'year', 'quantity', 'priceMin', 'priceMax', 'priceMod']
In [ ]:
AllTable[0].head()
In [ ]:
# Save the dataframe to a csv file
AllTable[0].to_csv('MonthWiseMarketArrivals.csv', index = False)