2. Acquire the Data

Finding Data Sources

There are three place to get onion price and quantity information by market.

  1. Agmarket - This is the website run by the Directorate of Marketing & Inspection (DMI), Ministry of Agriculture, Government of India and provides daily price and arrival data for all agricultural commodities at national and state level. Unfortunately, the link to get Market-wise Daily Report for Specific Commodity (Onion for us) leads to a multipage aspx entry form to get data for each date. So it is like to require an involved scraper to get the data. Too much effort - Move on. Here is the best link to go to get what is available - http://agmarknet.nic.in/agnew/NationalBEnglish/SpecificCommodityWeeklyReport.aspx?ss=1
  1. Data.gov.in - This is normally a good place to get government data in a machine readable form like csv or xml. The Variety-wise Daily Market Prices Data of Onion is available for each year as an XML but unfortunately it does not include quantity information that is needed. It would be good to have both price and quantity - so even though this is easy, lets see if we can get both from a different source. Here is the best link to go to get what is available - https://data.gov.in/catalog/variety-wise-daily-market-prices-data-onion#web_catalog_tabs_block_10
  1. NHRDF - This is the website of National Horticultural Research & Development Foundation and maintains a database on Market Arrivals and Price, Area and Production and Export Data for three commodities - Garlic, Onion and Potatoes. We are in luck! It also has data from 1996 onwards and has only got one form to fill to get the data in a tabular form. Further it also has production and export data. Excellent. Lets use this. Here is the best link to got to get all that is available - http://nhrdf.org/en-us/DatabaseReports

Scraping the Data

Ways to Scrape Data

Now we can do this in two different levels of sophistication

  1. Automate the form filling process: The form on this page looks simple. But viewing source in the browser shows there form to fill with hidden fields and we will need to access it as a browser to get the session fields and then submit the form. This is a little bit more complicated than simple scraping a table on a webpage

  2. Manually fill the form: What if we manually fill the form with the desired form fields and then save the page as a html file. Then we can read this file and just scrape the table from it. Lets go with the simple way for now.

Scraping - Manual Form Filling

So let us fill the form to get a small subset of data and test our scraping process. We will start by getting the Monthwise Market Arrivals.

  • Crop Name: Onion
  • Month: January
  • Market: All
  • Year: 2016

The saved webpage is available at MonthWiseMarketArrivalsJan2016.html

Understand the HTML Structure

We need to scrape data from this html page... So let us try to understand the structure of the page.

  1. You can view the source of the page - typically Right Click and View Source on any browser and that would give your the source HTML for any page.

  2. You can open the developer tools in your browser and investigate the structure as you mouse over the page

  3. We can use a tools like Selector Gadget to understand the id's and classes' used in the web page

Our data is under the <table> tag

Exercise #1

Find the number of tables in the HTML Structure of MonthWiseMarketArrivalsJan2016.html?


In [ ]:

Find all the Tables


In [1]:
# Import the library we need, which is Pandas
import pandas as pd

In [2]:
# Read all the tables from the html document 
AllTables = pd.read_html('MonthWiseMarketArrivalsJan2016.html')

In [3]:
# Let us find out how many tables has it found?
len(AllTables)


Out[3]:
5

In [4]:
type(AllTables)


Out[4]:
list

Exercise #2

Find the exact table of data we want in the list of AllTables?


In [5]:
AllTables[4]


Out[5]:
0 1 2 3 4 5 6
0 Market Month Name Year Arrival (q) Price Minimum (Rs/q) Price Maximum (Rs/q) Modal Price (Rs/q)
1 AGRA(UP) January 2016 134200 1039 1443 1349
2 AHMEDABAD(GUJ) January 2016 198390 646 1224 997
3 AHMEDNAGAR(MS) January 2016 208751 175 1722 1138
4 AJMER(RAJ) January 2016 4247 722 1067 939
5 ALIGARH(UP) January 2016 12350 1219 1298 1257
6 ALWAR(RAJ) January 2016 9788 625 1200 912
7 AMRITSAR(PB) January 2016 24800 913 1308 1160
8 BALLIA(UP) January 2016 600 1400 1500 1460
9 BANGALORE January 2016 507223 200 1943 1448
10 BAREILLY(UP) January 2016 18435 1149 1149 1149
11 BELGAUM(KNT) January 2016 61564 485 1826 1164
12 BHATINDA(PB) January 2016 5510 1218 1764 1459
13 BHAVNAGAR(GUJ) January 2016 588870 767 1174 969
14 BHUBNESWER(OR) January 2016 34050 1551 1658 1619
15 BIJAPUR(KNT) January 2016 4110 413 1713 1088
16 BURDWAN(WB) January 2016 3880 1689 1789 1739
17 CHAKAN(MS) January 2016 41125 940 1680 1350
18 CHANDIGARH January 2016 4310 800 1200 1000
19 CHANDVAD(MS) January 2016 108619 461 1361 1094
20 CHENNAI January 2016 116700 1927 2245 2086
21 DEESA(GUJ) January 2016 784 900 1575 1350
22 DEHRADOON(UTT) January 2016 10068 808 1048 935
23 DELHI January 2016 249231 467 1573 1154
24 DEVALA(MS) January 2016 127136 828 1321 1075
25 DHAVANGERE(KNT) January 2016 13170 435 1545 990
26 DHULIA(MS) January 2016 98405 225 1268 976
27 GONDAL(GUJ) January 2016 153695 292 1061 760
28 GUWAHATI January 2016 3270 1641 1784 1713
29 HASSAN(KNT) January 2016 12211 782 1480 1255
... ... ... ... ... ... ... ...
55 MUMBAI January 2016 413681 654 1473 1215
56 NAGPUR January 2016 98645 954 1367 1260
57 NEWASA(MS) January 2016 296461 250 1758 1458
58 NIPHAD(MS) January 2016 33110 505 1108 958
59 PALAYAM(KER) January 2016 8100 1835 2106 1994
60 PATIALA(PB) January 2016 1445 1000 1575 1300
61 PATNA January 2016 29200 1440 1548 1497
62 PHALTAN (MS) January 2016 2662 433 1684 1117
63 PIMPALGAON(MS) January 2016 506740 403 1448 1004
64 PUNE(MS) January 2016 325669 613 1729 1488
65 PURULIA(WB) January 2016 1200 1640 1900 1800
66 RAHATA(MS) January 2016 149286 427 1986 1223
67 RAHURI(MS) January 2016 56957 200 1675 1038
68 RAICHUR(KNT) January 2016 17795 987 1590 1331
69 RAIPUR(CHGARH) January 2016 4300 1193 1371 1299
70 RAJKOT(GUJ) January 2016 80400 389 980 783
71 SAGAR(MP) January 2016 1954 792 1000 892
72 SAIKHEDA(MS) January 2016 106742 375 1260 898
73 SANGALI(MS) January 2016 16104 413 2125 1231
74 SANGAMNER(MS) January 2016 117471 500 1853 1312
75 SATANA(MS) January 2016 69286 429 1354 957
76 SHRIRAMPUR(MS) January 2016 23497 271 1786 1350
77 SINNAR(MS) January 2016 73933 160 1363 988
78 SOLAPUR(MS) January 2016 403797 124 2285 958
79 SURAT(GUJ) January 2016 31700 943 1562 1252
80 UDAIPUR(RAJ) January 2016 6456 386 1307 846
81 VANI(MS) January 2016 60983 767 1323 1007
82 VARANASI(UP) January 2016 28900 1460 1503 1484
83 YEOLA(MS) January 2016 437432 437 1272 1034
84 NaN NaN Total 9307923 751(Avg) 1490(Avg) 1186(Avg)

85 rows × 7 columns

Get the exact table

To read the exact table we need to pass in an identifier value which would identify the table. We can use the attrs parameter in read_html to do so. The parameter we will pass is the id variable


In [6]:
# So can we read our exact table
OneTable = pd.read_html('MonthWiseMarketArrivalsJan2016.html', 
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})

In [7]:
# So how many tables have we got now
len(OneTable)


Out[7]:
1

In [8]:
# Show the table of data identifed by pandas with just the first five rows
OneTable[0].head()


Out[8]:
0 1 2 3 4 5 6
0 Market Month Name Year Arrival (q) Price Minimum (Rs/q) Price Maximum (Rs/q) Modal Price (Rs/q)
1 AGRA(UP) January 2016 134200 1039 1443 1349
2 AHMEDABAD(GUJ) January 2016 198390 646 1224 997
3 AHMEDNAGAR(MS) January 2016 208751 175 1722 1138
4 AJMER(RAJ) January 2016 4247 722 1067 939

However, we have not got the header correctly in our dataframe. Let us see if we can fix this.

To get help on any function just use ?? before the function to help. Run this function and see what additional parameter you need to define to get the header correctly


In [ ]:
??pd.read_html

Exercise #3

Read the html file again and ensure that the correct header is identifed by pandas?


In [11]:
OneTable = pd.read_html('MonthWiseMarketArrivalsJan2016.html', header = 0,
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})

Show the top five rows of the dataframe you have read to ensure the headers are now correct.


In [10]:
OneTable[0].head()


Out[10]:
Market Month Name Year Arrival (q) Price Minimum (Rs/q) Price Maximum (Rs/q) Modal Price (Rs/q)
0 AGRA(UP) January 2016 134200 1039 1443 1349
1 AHMEDABAD(GUJ) January 2016 198390 646 1224 997
2 AHMEDNAGAR(MS) January 2016 208751 175 1722 1138
3 AJMER(RAJ) January 2016 4247 722 1067 939
4 ALIGARH(UP) January 2016 12350 1219 1298 1257

Dataframe Viewing


In [12]:
# Let us store the dataframe in a df variable. You will see that as a very common convention in data science pandas use
df = OneTable[0]

In [13]:
# Shape of the dateset - number of rows & number of columns in the dataframe
df.shape


Out[13]:
(84, 7)

In [14]:
# Get the names of all the columns 
df.columns


Out[14]:
Index(['Market', 'Month Name', 'Year', 'Arrival (q)', 'Price Minimum (Rs/q)',
       'Price Maximum (Rs/q)', 'Modal Price (Rs/q)'],
      dtype='object')

In [15]:
# Can we see sample rows - the top 5 rows
df.head()


Out[15]:
Market Month Name Year Arrival (q) Price Minimum (Rs/q) Price Maximum (Rs/q) Modal Price (Rs/q)
0 AGRA(UP) January 2016 134200 1039 1443 1349
1 AHMEDABAD(GUJ) January 2016 198390 646 1224 997
2 AHMEDNAGAR(MS) January 2016 208751 175 1722 1138
3 AJMER(RAJ) January 2016 4247 722 1067 939
4 ALIGARH(UP) January 2016 12350 1219 1298 1257

In [16]:
# Can we see sample rows - the bottom 5 rows
df.tail()


Out[16]:
Market Month Name Year Arrival (q) Price Minimum (Rs/q) Price Maximum (Rs/q) Modal Price (Rs/q)
79 UDAIPUR(RAJ) January 2016 6456 386 1307 846
80 VANI(MS) January 2016 60983 767 1323 1007
81 VARANASI(UP) January 2016 28900 1460 1503 1484
82 YEOLA(MS) January 2016 437432 437 1272 1034
83 NaN NaN Total 9307923 751(Avg) 1490(Avg) 1186(Avg)

In [17]:
# Can we access a specific columns
df["Market"]


Out[17]:
0            AGRA(UP)
1      AHMEDABAD(GUJ)
2      AHMEDNAGAR(MS)
3          AJMER(RAJ)
4         ALIGARH(UP)
5          ALWAR(RAJ)
6        AMRITSAR(PB)
7          BALLIA(UP)
8           BANGALORE
9        BAREILLY(UP)
10       BELGAUM(KNT)
11       BHATINDA(PB)
12     BHAVNAGAR(GUJ)
13     BHUBNESWER(OR)
14       BIJAPUR(KNT)
15        BURDWAN(WB)
16         CHAKAN(MS)
17         CHANDIGARH
18       CHANDVAD(MS)
19            CHENNAI
20         DEESA(GUJ)
21     DEHRADOON(UTT)
22              DELHI
23         DEVALA(MS)
24    DHAVANGERE(KNT)
25         DHULIA(MS)
26        GONDAL(GUJ)
27           GUWAHATI
28        HASSAN(KNT)
29     HOSHIARPUR(PB)
           ...       
54             MUMBAI
55             NAGPUR
56         NEWASA(MS)
57         NIPHAD(MS)
58       PALAYAM(KER)
59        PATIALA(PB)
60              PATNA
61       PHALTAN (MS)
62     PIMPALGAON(MS)
63           PUNE(MS)
64        PURULIA(WB)
65         RAHATA(MS)
66         RAHURI(MS)
67       RAICHUR(KNT)
68     RAIPUR(CHGARH)
69        RAJKOT(GUJ)
70          SAGAR(MP)
71       SAIKHEDA(MS)
72        SANGALI(MS)
73      SANGAMNER(MS)
74         SATANA(MS)
75     SHRIRAMPUR(MS)
76         SINNAR(MS)
77        SOLAPUR(MS)
78         SURAT(GUJ)
79       UDAIPUR(RAJ)
80           VANI(MS)
81       VARANASI(UP)
82          YEOLA(MS)
83                NaN
Name: Market, dtype: object

In [18]:
# Using the dot notation
df.Market


Out[18]:
0            AGRA(UP)
1      AHMEDABAD(GUJ)
2      AHMEDNAGAR(MS)
3          AJMER(RAJ)
4         ALIGARH(UP)
5          ALWAR(RAJ)
6        AMRITSAR(PB)
7          BALLIA(UP)
8           BANGALORE
9        BAREILLY(UP)
10       BELGAUM(KNT)
11       BHATINDA(PB)
12     BHAVNAGAR(GUJ)
13     BHUBNESWER(OR)
14       BIJAPUR(KNT)
15        BURDWAN(WB)
16         CHAKAN(MS)
17         CHANDIGARH
18       CHANDVAD(MS)
19            CHENNAI
20         DEESA(GUJ)
21     DEHRADOON(UTT)
22              DELHI
23         DEVALA(MS)
24    DHAVANGERE(KNT)
25         DHULIA(MS)
26        GONDAL(GUJ)
27           GUWAHATI
28        HASSAN(KNT)
29     HOSHIARPUR(PB)
           ...       
54             MUMBAI
55             NAGPUR
56         NEWASA(MS)
57         NIPHAD(MS)
58       PALAYAM(KER)
59        PATIALA(PB)
60              PATNA
61       PHALTAN (MS)
62     PIMPALGAON(MS)
63           PUNE(MS)
64        PURULIA(WB)
65         RAHATA(MS)
66         RAHURI(MS)
67       RAICHUR(KNT)
68     RAIPUR(CHGARH)
69        RAJKOT(GUJ)
70          SAGAR(MP)
71       SAIKHEDA(MS)
72        SANGALI(MS)
73      SANGAMNER(MS)
74         SATANA(MS)
75     SHRIRAMPUR(MS)
76         SINNAR(MS)
77        SOLAPUR(MS)
78         SURAT(GUJ)
79       UDAIPUR(RAJ)
80           VANI(MS)
81       VARANASI(UP)
82          YEOLA(MS)
83                NaN
Name: Market, dtype: object

In [19]:
# Selecting specific column and rows
df[0:5]["Market"]


Out[19]:
0          AGRA(UP)
1    AHMEDABAD(GUJ)
2    AHMEDNAGAR(MS)
3        AJMER(RAJ)
4       ALIGARH(UP)
Name: Market, dtype: object

In [20]:
# Works both ways
df["Market"][0:5]


Out[20]:
0          AGRA(UP)
1    AHMEDABAD(GUJ)
2    AHMEDNAGAR(MS)
3        AJMER(RAJ)
4       ALIGARH(UP)
Name: Market, dtype: object

In [21]:
#Getting unique values of State
pd.unique(df['Market'])


Out[21]:
array(['AGRA(UP)', 'AHMEDABAD(GUJ)', 'AHMEDNAGAR(MS)', 'AJMER(RAJ)',
       'ALIGARH(UP)', 'ALWAR(RAJ)', 'AMRITSAR(PB)', 'BALLIA(UP)',
       'BANGALORE', 'BAREILLY(UP)', 'BELGAUM(KNT)', 'BHATINDA(PB)',
       'BHAVNAGAR(GUJ)', 'BHUBNESWER(OR)', 'BIJAPUR(KNT)', 'BURDWAN(WB)',
       'CHAKAN(MS)', 'CHANDIGARH', 'CHANDVAD(MS)', 'CHENNAI', 'DEESA(GUJ)',
       'DEHRADOON(UTT)', 'DELHI', 'DEVALA(MS)', 'DHAVANGERE(KNT)',
       'DHULIA(MS)', 'GONDAL(GUJ)', 'GUWAHATI', 'HASSAN(KNT)',
       'HOSHIARPUR(PB)', 'HUBLI(KNT)', 'HYDERABAD', 'INDORE(MP)', 'JAIPUR',
       'JALANDHAR(PB)', 'JALGAON(MS)', 'JAMMU', 'JAMNAGAR(GUJ)',
       'JODHPUR(RAJ)', 'KALVAN(MS)', 'KANPUR(UP)', 'KARNAL(HR)',
       'KHANNA(PB)', 'KOLHAPUR(MS)', 'KOLKATA', 'KOPERGAON(MS)',
       'KOTA(RAJ)', 'KURNOOL(AP)', 'LASALGAON(MS)', 'LONAND(MS)',
       'LUCKNOW', 'MAHUVA(GUJ)', 'MALEGAON(MS)', 'MANMAD(MS)', 'MUMBAI',
       'NAGPUR', 'NEWASA(MS)', 'NIPHAD(MS)', 'PALAYAM(KER)', 'PATIALA(PB)',
       'PATNA', 'PHALTAN (MS)', 'PIMPALGAON(MS)', 'PUNE(MS)',
       'PURULIA(WB)', 'RAHATA(MS)', 'RAHURI(MS)', 'RAICHUR(KNT)',
       'RAIPUR(CHGARH)', 'RAJKOT(GUJ)', 'SAGAR(MP)', 'SAIKHEDA(MS)',
       'SANGALI(MS)', 'SANGAMNER(MS)', 'SATANA(MS)', 'SHRIRAMPUR(MS)',
       'SINNAR(MS)', 'SOLAPUR(MS)', 'SURAT(GUJ)', 'UDAIPUR(RAJ)',
       'VANI(MS)', 'VARANASI(UP)', 'YEOLA(MS)', nan], dtype=object)

Downloading the Entire Month Wise Arrival Data


In [ ]:
AllTable = pd.read_html('MonthWiseMarketArrivals.html', header = 0,
                      attrs = {'id' : 'dnn_ctr974_MonthWiseMarketArrivals_GridView1'})

In [ ]:
AllTable[0].head()

In [ ]:
??pd.DataFrame.to_csv

In [ ]:
AllTable[0].columns

In [ ]:
# Change the column names to simpler ones
AllTable[0].columns = ['market', 'month', 'year', 'quantity', 'priceMin', 'priceMax', 'priceMod']

In [ ]:
AllTable[0].head()

In [ ]:
# Save the dataframe to a csv file
AllTable[0].to_csv('MonthWiseMarketArrivals.csv', index = False)